Closed harshyadav17 closed 4 months ago
Please check your preprocessing logs, as it is likely that some stage may have failed (I suspect the SPM stage).
The rationale behind this is that we want our vocabulary to be restricted to a predefined size. This is achieved by applying the SPM model to the data, which limits the number of types to be less than or equal to the vocabulary size.
Therefore, if you follow all the preprocessing steps correctly and binarize the data that has been preprocessed using the SPM model, it should not be possible for the fairseq dictionary to have more types than the specified vocabulary size.
Please recheck if you have modified any preprocessing code or if all the preprocessing steps were completed successfully.
hey @PranjalChitale
I have been facing this issue for quite some time now. I am training the same model from scratch using the BPCC data (230M) and provided SPM model (SRC 32k, TGT 128k). But on running the prepare_data_joint_training.sh, on the last step, i.e. fairseq-preprocess I am getting the weird dictionary size for SRC,
fairseq_cli.preprocess | [SRC] Dictionary: 639088 types
For TGT it is
fairseq_cli.preprocess | [TGT] Dictionary: 111448 types
Can you please help on this, where exactly I am going wrong.
Thanks!