Hi,how can i get the ted-59 and opus-100 dataset?

cordercorder / nmt-multi

Codebase for multilingual neural machine translation

MIT License

13 stars 2 forks source link

Hi,how can i get the ted-59 and opus-100 dataset? #1

Open altctrl00 opened 1 year ago

cordercorder commented 1 year ago

Hi, thanks for your attention.

The TED-59 dataset can be accessed from http://phontron.com/data/ted_talks.tar.gz. The python script ted_reader.py in this repository can be used to read the parallel corpus.

The OPUS-100 dataset can be found at https://github.com/EdinburghNLP/opus-100-corpus.

altctrl00 commented 1 year ago

Thanks,btw,your work is very exciting!

cordercorder commented 1 year ago

Thank you. I am really excited that you like our work.

altctrl00 commented 1 year ago

it seems that the ted dataset you mentioned has 60 languages ,but you only use 59?

cordercorder commented 1 year ago

Yes, it seems that there are 60 language pairs in the raw dataset. But the column named calv is empty, which indicates that there is no parallel corpus between calv and en. Therefore, there are 59 language pairs in the TED-59 dataset.