AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
226 stars 63 forks source link

Fairseq dictionary Size #78

Closed harshyadav17 closed 4 months ago

harshyadav17 commented 5 months ago

hey @PranjalChitale

I have been facing this issue for quite some time now. I am training the same model from scratch using the BPCC data (230M) and provided SPM model (SRC 32k, TGT 128k). But on running the prepare_data_joint_training.sh, on the last step, i.e. fairseq-preprocess I am getting the weird dictionary size for SRC, fairseq_cli.preprocess | [SRC] Dictionary: 639088 types

For TGT it is fairseq_cli.preprocess | [TGT] Dictionary: 111448 types

Can you please help on this, where exactly I am going wrong.

Thanks!

PranjalChitale commented 5 months ago

Please check your preprocessing logs, as it is likely that some stage may have failed (I suspect the SPM stage).

The rationale behind this is that we want our vocabulary to be restricted to a predefined size. This is achieved by applying the SPM model to the data, which limits the number of types to be less than or equal to the vocabulary size.

Therefore, if you follow all the preprocessing steps correctly and binarize the data that has been preprocessed using the SPM model, it should not be possible for the fairseq dictionary to have more types than the specified vocabulary size.

Please recheck if you have modified any preprocessing code or if all the preprocessing steps were completed successfully.