Filtering out incorrect sentence pairs by language detection

bartvm / nmt

Neural machine translation

MIT License

2 stars 2 forks source link

Filtering out incorrect sentence pairs by language detection #62

Open JinseokNam opened 8 years ago

JinseokNam commented 8 years ago

Due to the noise in the corpus, especially CommonCrawl, some sentences can be thought of very bad for training models.

This script attempts to detect languages of a pair of sentences. If the probability on either side is lower than 0.9, then it will filter out the pair from the corpus. Here's example sentences which are filtered out.

Failed to detect the languages for the following pair SRC: ( 1 ) TRG: ( 1 ) Failed to detect the languages for the following pair SRC: . TRG: . Failed to detect the languages for the following pair SRC: Frau Präsidentin ! TRG: . Failed to detect the languages for the following pair SRC: ይህ ገጽ መጨረሻ የተቀየረው እ.ኣ.አ በ12 : 18 ፣ 16 ኦገስት 2009 ዓ.ም. TRG: ይህ ገጽ መጨረሻ የተቀየረው እ.ኣ.አ በ14 : 53 ፣ 27 ኦክቶበር 2009 ዓ.ም.

This preprocessing step left us 3,858,225 sentence pairs out of 4,175,306.

bartvm commented 8 years ago

Any reason to prefer this over the more usual methods in MT e.g. Axelrod's approach?

JinseokNam commented 8 years ago

The reason is just to remove very noisy sentence pairs in terms of language regardless of their contents.

In the common crawl corpus, you will see the following sentence at line 8 in the common-crawl.de-en.de file.

ACDSee 9 Photo Manager Organize your photos. Share your world.

Though the file is supposed to contain only German sentences, sometimes sentences in different languages can be found.

Let's see its pair in the common-crawl.de-en.en file.

Translator Internet is a Toolbar for MS Internet Explorer.

It's a completely ridiculous sentence pair and I want to get rid of this sort of sentences.

It seems that the approach which you've linked respects the content of sentence pairs using language models as well, which is definitely worth to try out.

bartvm commented 8 years ago

Yeah, I understand the motivation. At the very least it should speed up training. Some articles have actually shown some pretty decent improvements in BLEU score by performing data selection, although I think that NMT is likely to be less susceptible to this kind of noise than SMT, so I don't know whether we'll see any difference in final BLEU scores.

What I was wondering was more whether this particular package, langdetect, is the right way to go. There are methods out there that were developed specifically for doing data selection in SMT, so perhaps it's better to go down that road? On the other hand, this is easy to implement and maybe does just as well? (I'm not sure what methods this langdetect method uses internally?)

JinseokNam commented 8 years ago

I agree with your point that it might be better to use tools developed for that purpose in the SMT community. The current approach might be the simplest one in terms of implementation and later we could use a better one.

The author of this package argue that the accuracy of the package is around 99% and used naive Bayes. Language detection is a very easy task, so naive Bayes might be a strong enough method :smile:

Anyway, let me look for some alternatives, or let me know if you have better solutions.