chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.22k stars 250 forks source link

Replace sklearn lang id with thinc version #326

Closed bdewilde closed 3 years ago

bdewilde commented 3 years ago

Description

Motivation and Context

Models/pipelines built and trained in scikit-learn don't necessarily work in the next version, and users rightfully worry about warning messages saying as much. Since sklearn releases quite regularly, this poses a not insignificant maintenance burden to keep releasing updated language identification pipelines. Furthermore, sklearn doesn't offer much in the way of neural network models, so the lang id pipeline was necessarily old-school (bag-of-words!) and limited.

On the other hand, thinc is relatively stable, already bundled tightly with spacy, and allows for the construction of advanced neural networks. It proved to be A LOT of work to make things go, but the end result is a definite improvement!

How Has This Been Tested?

all tests pass, and the lang id results look sensible, too

Screenshots (if appropriate):

Types of changes

Checklist: