Closed valentincalomme closed 4 years ago
We use fastText on each sentence to detect the language. As any statistical approach, it makes errors .... However, we only used the highest scoring language. In the next version, we will also threshold the likelihood of the detected language to exlcude cases when LID is not sure, e.g. P("en")=0.34 P("de")=0.33 P("da")=0.32 In that (constructed) case, it would be better to not consider the sentence as English. This happened in the example you cite.
The WMT 2020 eval provide the WikiMatrix bitexts with an additional language identification (by the langid tool). You may consider this version of the bitexts.
Great, thanks! Would be useful to have this added to the documentation. I reckon you're referring to this dataset: http://data.statmt.org/wmt20/translation-task/WikiMatrix/ ?
After toying around with some of the data, I found out that quite some data comes from the wrong language. As an example, here are the top 100 lines from the German/English data (
de-en.tsv
).It seems that there is quite some dialogue data, where people narrate things in other languages, leading to some English sentences being German and so on.
After some preliminary research using
langdetect
from Google, it appears that about 7% of the lines contain text coming from the wrong language.