Open altctrl00 opened 1 year ago
Thanks,btw,your work is very exciting!
Thank you. I am really excited that you like our work.
it seems that the ted dataset you mentioned has 60 languages ,but you only use 59?
Yes, it seems that there are 60 language pairs in the raw dataset. But the column named calv
is empty, which indicates that there is no parallel corpus between calv
and en
. Therefore, there are 59 language pairs in the TED-59 dataset.
Hi, thanks for your attention.
The TED-59 dataset can be accessed from http://phontron.com/data/ted_talks.tar.gz. The python script
ted_reader.py
in this repository can be used to read the parallel corpus.The OPUS-100 dataset can be found at https://github.com/EdinburghNLP/opus-100-corpus.