Add support to fasttext for language detection

kirianguiller commented 3 years ago

Hello everyone, I would like to contribute to your project by adding the support of fasttext for language detection.

Indeed, compares to langid, cld and langdetect, fasttext has 2 big advantages :

It's much more accurate (cf this medium article, and I double checked on tatoeba corpus to be sure)
It's much faster (cf this same medium article) , by a factor of more than 100.

The speed gain is huge because the language detection filtering is the biggest bottleneck between all filters you have in OpusFilter. For instance, for 8 millions of korean_english sentences, the filtering went from many hours to less than 10 minutes on my machine. And the quality is better.

Is there any things I should add to have more chance to have this PR accepted ? Some test ? Some documentation ?

Thanks ! Kirian

svirpioj commented 3 years ago

Thanks for contributing! I have had problems (both on accuracy and efficiency) with the langid and cld2 libraries, too, so this sounds great.

First, please rebase the pull request on the develop branch. All future enhancements will be added through it.

Some unit tests would be good, and tests/test_filters.py would be a suitable place for them. Of course a problem is that this requires a trained model, which are apparently not directly provided by the fasttext library. A preferred way would be to train a tiny model with some dummy data (cf. to test_lm_filter and test_wordalign_filter).

The new options to LanguageIDFilter should be added to README.md.

Suggested code changes:

Please follow PEP 8. (Some whitespaces missing in fasttext_predict_lang.)
For clarity, the fasttext_predict_lang method would be better inside LanguageIDFilter as a private method.
ConfigurationError should be raised if fasttext is selected as a method and fasttext_model_path is not defined (or vice versa).

Let me know if you have trouble with these.

svirpioj commented 3 years ago

@kirianguiller, do you plan to continue with this? We can try to assign the work to someone else if you are not able to.

kirianguiller commented 3 years ago

Oops sorry I forgot this PR. I will work on this in the following days. Thanks for the reminder !

kirianguiller commented 3 years ago

closing PR for the sake of #14

Helsinki-NLP / OpusFilter

Add support to fasttext for language detection #12