Closed kirianguiller closed 3 years ago
Thanks for contributing! I have had problems (both on accuracy and efficiency) with the langid and cld2 libraries, too, so this sounds great.
First, please rebase the pull request on the develop
branch. All future enhancements will be added through it.
Some unit tests would be good, and tests/test_filters.py
would be a suitable place for them. Of course a problem is that this requires a trained model, which are apparently not directly provided by the fasttext
library. A preferred way would be to train a tiny model with some dummy data (cf. to test_lm_filter
and test_wordalign_filter
).
The new options to LanguageIDFilter
should be added to README.md
.
Suggested code changes:
fasttext_predict_lang
.)fasttext_predict_lang
method would be better inside LanguageIDFilter
as a private method.ConfigurationError
should be raised if fasttext is selected as a method and fasttext_model_path
is not defined (or vice versa).Let me know if you have trouble with these.
@kirianguiller, do you plan to continue with this? We can try to assign the work to someone else if you are not able to.
Oops sorry I forgot this PR. I will work on this in the following days. Thanks for the reminder !
closing PR for the sake of #14
Hello everyone, I would like to contribute to your project by adding the support of fasttext for language detection.
Indeed, compares to langid, cld and langdetect, fasttext has 2 big advantages :
The speed gain is huge because the language detection filtering is the biggest bottleneck between all filters you have in OpusFilter. For instance, for 8 millions of korean_english sentences, the filtering went from many hours to less than 10 minutes on my machine. And the quality is better.
Is there any things I should add to have more chance to have this PR accepted ? Some test ? Some documentation ?
Thanks ! Kirian