Closed harshyadav17 closed 3 months ago
Please check the Hugging Face repo for the necessary scripts and refer to the commits for detailed steps to make it compatible with AutoTokenizer.
If you have any specific questions, feel free to post them here, and we'll be happy to help.
@PranjalChitale thanks!
How did you get the dict.SRC.json and dict.TGT.json. I see it has different ids when compared to the .txt files present in final_bin folder or the ones share by you over here: https://indictrans2-public.objectstore.e2enetworks.net/en-indic-fairseq-dict.zip
It would be great if you can share the thought process behind generating such json files.
Thanks!
@harshyadav17, the difference between the "txt" and "json" files arises because Fairseq automatically includes the following special tokens when the dictionary is loaded:
"<s>": 0,
"<pad>": 1,
"</s>": 2,
"<unk>": 3,
You can check this out here.
Therefore, we need to manually add these tokens at the beginning, followed by all the entries from the dictionary file.
The IDs themselves are not different; Fairseq simply adds these tokens automatically, while in this case, we are adding them manually.
Hope this makes it clear.
Hey @prajdabre @PranjalChitale I have converted my fine-tuned fairseq model to HF format using the following link: https://github.com/AI4Bharat/IndicTrans2/blob/main/huggingface_interface/convert_indictrans_checkpoint_to_pytorch.py
Presently, I am stuck on how to convert the custom tokenizer (vocab and final_bin) into HF autotokenizer. It would be great if you can share the script/steps for the same.
Thanks!