refactor language identification functionality into a subpackage named lang_id, and remove the old top-level module lang_utils; user-facing API is more or less unchanged, but internals are all new
replace simple language identification pipeline built using sklearn with a more complex and accurate model built using thinc, now much more closely resembling Google CLD3
adds quick-and-dirty training functionality for the new model + thinc
Motivation and Context
Models/pipelines built and trained in scikit-learn don't necessarily work in the next version, and users rightfully worry about warning messages saying as much. Since sklearn releases quite regularly, this poses a not insignificant maintenance burden to keep releasing updated language identification pipelines. Furthermore, sklearn doesn't offer much in the way of neural network models, so the lang id pipeline was necessarily old-school (bag-of-words!) and limited.
On the other hand, thinc is relatively stable, already bundled tightly with spacy, and allows for the construction of advanced neural networks. It proved to be A LOT of work to make things go, but the end result is a definite improvement!
How Has This Been Tested?
all tests pass, and the lang id results look sensible, too
Screenshots (if appropriate):
Types of changes
[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[x] Breaking change (fix or feature that would cause existing functionality to change)
Checklist:
[x] My code follows the code style of this project.
[x] My change requires a change to the documentation, and I have updated it accordingly.
Description
lang_id
, and remove the old top-level modulelang_utils
; user-facing API is more or less unchanged, but internals are all newsklearn
with a more complex and accurate model built usingthinc
, now much more closely resembling Google CLD3Motivation and Context
Models/pipelines built and trained in
scikit-learn
don't necessarily work in the next version, and users rightfully worry about warning messages saying as much. Since sklearn releases quite regularly, this poses a not insignificant maintenance burden to keep releasing updated language identification pipelines. Furthermore, sklearn doesn't offer much in the way of neural network models, so the lang id pipeline was necessarily old-school (bag-of-words!) and limited.On the other hand,
thinc
is relatively stable, already bundled tightly withspacy
, and allows for the construction of advanced neural networks. It proved to be A LOT of work to make things go, but the end result is a definite improvement!How Has This Been Tested?
all tests pass, and the lang id results look sensible, too
Screenshots (if appropriate):
Types of changes
Checklist: