Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
102 stars 18 forks source link

Add support to fasttext for language detection #12

Closed kirianguiller closed 3 years ago

kirianguiller commented 3 years ago

Hello everyone, I would like to contribute to your project by adding the support of fasttext for language detection.

Indeed, compares to langid, cld and langdetect, fasttext has 2 big advantages :

The speed gain is huge because the language detection filtering is the biggest bottleneck between all filters you have in OpusFilter. For instance, for 8 millions of korean_english sentences, the filtering went from many hours to less than 10 minutes on my machine. And the quality is better.

Is there any things I should add to have more chance to have this PR accepted ? Some test ? Some documentation ?

Thanks ! Kirian

svirpioj commented 3 years ago

Thanks for contributing! I have had problems (both on accuracy and efficiency) with the langid and cld2 libraries, too, so this sounds great.

First, please rebase the pull request on the develop branch. All future enhancements will be added through it.

Some unit tests would be good, and tests/test_filters.py would be a suitable place for them. Of course a problem is that this requires a trained model, which are apparently not directly provided by the fasttext library. A preferred way would be to train a tiny model with some dummy data (cf. to test_lm_filter and test_wordalign_filter).

The new options to LanguageIDFilter should be added to README.md.

Suggested code changes:

Let me know if you have trouble with these.

svirpioj commented 3 years ago

@kirianguiller, do you plan to continue with this? We can try to assign the work to someone else if you are not able to.

kirianguiller commented 3 years ago

Oops sorry I forgot this PR. I will work on this in the following days. Thanks for the reminder !

kirianguiller commented 3 years ago

closing PR for the sake of #14