Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
102 stars 18 forks source link

Add support to fasttext for language detection (Develop branch) #14

Closed kirianguiller closed 3 years ago

kirianguiller commented 3 years ago

Hi everyone, here below my original PR for adding implementation of fasttext language detection (here link to original PR #12 ). I rebased it on the develop branch and apply some changes requested by @svirpioj .

However, I still didn't implement the tests, as there are no documentation on how to run them properly. @svirpioj , what is the procedure to run tests ?

Thanks !

Original PR message (#12 )


Hello everyone, I would like to contribute to your project by adding the support of fasttext for language detection.

Indeed, compares to langid, cld and langdetect, fasttext has 2 big advantages :

The speed gain is huge because the language detection filtering is the biggest bottleneck between all filters you have in OpusFilter. For instance, for 8 millions of korean_english sentences, the filtering went from many hours to less than 10 minutes on my machine. And the quality is better.

Is there any things I should add to have more chance to have this PR accepted ? Some test ? Some documentation ?

Thanks ! Kirian

svirpioj commented 3 years ago

However, I still didn't implement the tests, as there are no documentation on how to run them properly. @svirpioj , what is the procedure to run tests ?

I'm currently using pytest (e.g. python -m pytest tests/ in the project root). Also nosetests should work, if you have VariKN and eflomal set up as instructed in the README - pytest is able to skip the respective tests if not.

(By the way, do not care about the last GitHub checks failing - it's something that shouldn't be run for PRs, but I haven't yet had time to fix that.)

svirpioj commented 3 years ago

(By the way, do not care about the last GitHub checks failing - it's something that shouldn't be run for PRs, but I haven't yet had time to fix that.)

You can now rebase to develop to fix this.

kirianguiller commented 3 years ago

Hey @svirpioj , I've added a simple test for fasttext prediction, as well as a model folder (in tests/model) to store a "compressed version of a fasttext model" of only 1mb.

Edit : I just added a new test that is failing for fasttext. I will fix this this afternoon. <= It was the new test that had a problem, I fixed it and then commited it.

svirpioj commented 3 years ago

I finished this up in https://github.com/Helsinki-NLP/OpusFilter/pull/20