AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
214 stars 59 forks source link

Convert fairseq tokenizer (vocab and final_bin) to HF Autotokenizer #71

Closed harshyadav17 closed 3 months ago

harshyadav17 commented 3 months ago

Hey @prajdabre @PranjalChitale I have converted my fine-tuned fairseq model to HF format using the following link: https://github.com/AI4Bharat/IndicTrans2/blob/main/huggingface_interface/convert_indictrans_checkpoint_to_pytorch.py

Presently, I am stuck on how to convert the custom tokenizer (vocab and final_bin) into HF autotokenizer. It would be great if you can share the script/steps for the same.

Thanks!

PranjalChitale commented 3 months ago

Please check the Hugging Face repo for the necessary scripts and refer to the commits for detailed steps to make it compatible with AutoTokenizer.

If you have any specific questions, feel free to post them here, and we'll be happy to help.

harshyadav17 commented 3 months ago

@PranjalChitale thanks!

How did you get the dict.SRC.json and dict.TGT.json. I see it has different ids when compared to the .txt files present in final_bin folder or the ones share by you over here: https://indictrans2-public.objectstore.e2enetworks.net/en-indic-fairseq-dict.zip

It would be great if you can share the thought process behind generating such json files.

Thanks!

PranjalChitale commented 3 months ago

@harshyadav17, the difference between the "txt" and "json" files arises because Fairseq automatically includes the following special tokens when the dictionary is loaded:

  "<s>": 0,
  "<pad>": 1,
  "</s>": 2,
  "<unk>": 3,

You can check this out here.

Therefore, we need to manually add these tokens at the beginning, followed by all the entries from the dictionary file.

The IDs themselves are not different; Fairseq simply adds these tokens automatically, while in this case, we are adding them manually.

Hope this makes it clear.