AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
230 stars 65 forks source link

How to combine data from different sources for training #17

Closed look4pritam closed 1 year ago

look4pritam commented 1 year ago

Thank you for wonderful work.

I am trying to reproduce the results. I am having one query.

How to combine data from different sources for training ?

I am not able to find script for it.

PranjalChitale commented 1 year ago

You can use the following script to merge and dedup data from different subsets of BPCC.

look4pritam commented 1 year ago

Thank you very much for quick reply. I am trying to reproduce the results. I have found some documentation errors in training the models. If interested, I can generate the merge request with correction.