Added a built-in language identification classifier, inspired by Google's CLD3 and implemented with scikit-learn
added a lang_utils.LangIdentifier() class with functionality for automatically downloading model data from the textacy-data repo, and whose core method, .identify_lang(), is accessible as a module-level function for general usage; the actual pipeline is also available for power users, if they want to take advantage of the underlying scikit-learn features
added the lang identifier as a resource downloadable via textacy's CLI
added a script to fetch language-specific snippets from Wikipedia APIs for building up a high-quality training dataset
dropped the dependency on cld2-cffi and all associated functionality / documentation
bumped the minimum scikit-learn version from 0.17 => 0.18 out of necessity
Added a function for normalizing unicode, preprocessing.normalize_unicode(), as a less powerful stdlib replacement for the now properly not-implemented, ftfy-powered fix_bad_unicode()
Test dataset performance is pretty darn good:
precision recall f1-score support
ab 0.92 1.00 0.96 47
af 0.95 0.98 0.96 829
am 1.00 1.00 1.00 402
an 0.95 0.96 0.96 402
ar 1.00 1.00 1.00 3229
av 0.97 0.97 0.97 400
ay 0.94 0.97 0.95 400
az 0.98 0.97 0.97 847
ba 0.95 0.99 0.97 427
be 0.98 0.99 0.98 1630
bg 0.97 0.97 0.97 3424
bn 0.98 0.98 0.98 367
br 1.00 0.97 0.98 1047
bs 0.62 0.74 0.68 439
ca 0.97 0.93 0.95 1284
ce 1.00 0.99 1.00 224
ch 0.93 0.83 0.88 30
co 1.00 0.99 0.99 400
cs 0.97 0.99 0.98 3524
cv 0.99 0.98 0.98 550
cy 0.99 0.99 0.99 479
da 0.93 0.94 0.94 3731
de 0.99 0.99 0.99 3731
el 1.00 1.00 1.00 3731
en 0.99 0.99 0.99 3731
eo 0.98 0.99 0.98 3731
es 0.95 0.97 0.96 3731
et 0.97 0.96 0.96 707
eu 0.98 1.00 0.99 1209
fa 0.98 0.99 0.99 400
fi 1.00 0.99 1.00 3731
fo 0.98 0.96 0.97 425
fr 0.97 0.99 0.98 3731
fy 0.99 0.95 0.97 418
ga 1.00 0.97 0.98 621
gd 0.98 0.99 0.99 532
gl 0.91 0.88 0.90 1025
gn 1.00 0.97 0.98 495
gv 1.00 0.99 1.00 277
ha 0.99 1.00 1.00 347
he 1.00 1.00 1.00 3731
hi 1.00 1.00 1.00 1264
hr 0.72 0.49 0.58 995
ht 0.94 0.95 0.95 217
hu 1.00 0.99 1.00 3731
hy 1.00 0.99 1.00 538
ia 0.98 0.96 0.97 3731
id 0.97 0.97 0.97 2308
ie 0.92 0.93 0.93 763
ig 1.00 0.99 0.99 289
io 0.88 0.98 0.93 1041
is 0.99 0.99 0.99 1997
it 0.99 0.98 0.98 3731
ja 1.00 1.00 1.00 3731
jv 0.97 0.97 0.97 437
ka 1.00 1.00 1.00 474
kg 1.00 0.97 0.98 88
ki 1.00 1.00 1.00 125
kk 1.00 0.99 0.99 792
kl 1.00 0.99 1.00 227
km 0.99 1.00 1.00 117
kn 1.00 0.96 0.98 28
ko 1.00 1.00 1.00 504
ku 1.00 0.98 0.99 462
kv 0.99 0.96 0.98 400
kw 1.00 0.97 0.99 462
ky 1.00 0.98 0.99 414
la 0.96 0.98 0.97 3731
lb 0.98 0.97 0.97 410
lg 1.00 0.99 0.99 400
li 1.00 0.98 0.99 400
ln 0.98 0.90 0.94 243
lt 1.00 0.99 0.99 3731
lv 0.99 0.99 0.99 400
mg 1.00 1.00 1.00 403
mi 1.00 0.99 0.99 457
mk 0.96 0.98 0.97 3731
ml 1.00 0.98 0.99 128
mn 1.00 0.99 0.99 475
mr 1.00 1.00 1.00 3731
ms 0.83 0.85 0.84 400
mt 0.99 1.00 0.99 417
nb 0.88 0.84 0.86 1937
nl 0.98 0.99 0.98 3731
nn 0.92 0.89 0.91 549
no 0.90 0.94 0.92 400
nv 1.00 1.00 1.00 408
oc 0.91 0.94 0.93 1112
os 0.99 0.98 0.99 388
pa 1.00 0.84 0.91 25
pl 1.00 1.00 1.00 3731
ps 1.00 0.99 0.99 404
pt 0.98 0.96 0.97 3731
qu 1.00 0.94 0.97 385
rm 0.93 0.97 0.95 403
rn 0.93 0.87 0.90 87
ro 0.99 0.99 0.99 2958
ru 0.98 0.98 0.98 3731
rw 0.98 0.98 0.98 244
sc 0.99 0.97 0.98 401
sd 1.00 1.00 1.00 400
se 0.99 0.98 0.98 208
sk 0.99 0.85 0.91 731
sl 0.92 0.93 0.92 500
sn 1.00 0.99 0.99 402
so 1.00 1.00 1.00 401
sq 1.00 0.98 0.99 550
sr 0.89 0.93 0.91 3731
su 1.00 0.98 0.99 403
sv 0.98 0.99 0.98 3731
sw 1.00 0.99 0.99 400
ta 1.00 0.96 0.98 50
te 1.00 0.97 0.98 29
tg 0.98 0.98 0.98 178
th 1.00 0.96 0.98 92
tk 0.96 0.97 0.97 638
tl 0.99 1.00 1.00 2408
to 1.00 1.00 1.00 401
tr 0.99 1.00 0.99 3731
tt 0.99 0.98 0.99 2376
ty 1.00 0.95 0.98 22
ug 1.00 1.00 1.00 1376
uk 0.99 0.99 0.99 3731
ur 1.00 0.99 0.99 652
uz 0.96 0.93 0.95 236
vi 1.00 1.00 1.00 2033
vo 0.99 0.98 0.99 594
wa 1.00 0.99 1.00 403
wo 0.99 0.99 0.99 402
yi 1.00 1.00 1.00 529
yo 1.00 0.91 0.95 45
za 1.00 0.95 0.97 60
zh 0.99 0.99 0.99 400
zu 1.00 1.00 1.00 318
micro avg 0.98 0.98 0.98 165125
macro avg 0.97 0.97 0.97 165125
weighted avg 0.98 0.98 0.98 165125
Motivation and Context
Python's OSS language classification packages are weirdly, surprisingly... less-than-awesome. The best package I could find way back when was cld2-cffi, but it's largely unmaintained, the methodology is a bit dated, and it's problematic to install in certain dev environments.
I wanted something that's easy to install, actively maintained, and gives good performance in terms of classification accuracy and speed. Turns out I had to make it myself.
How Has This Been Tested?
Types of changes
[x] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
Checklist:
[x] My code follows the code style of this project.
[x] My change requires a change to the documentation, and I have updated it accordingly.
Description
scikit-learn
lang_utils.LangIdentifier()
class with functionality for automatically downloading model data from the textacy-data repo, and whose core method,.identify_lang()
, is accessible as a module-level function for general usage; the actual pipeline is also available for power users, if they want to take advantage of the underlyingscikit-learn
featurescld2-cffi
and all associated functionality / documentationscikit-learn
version from 0.17 => 0.18 out of necessitypreprocessing.normalize_unicode()
, as a less powerful stdlib replacement for the now properly not-implemented,ftfy
-poweredfix_bad_unicode()
Test dataset performance is pretty darn good:
Motivation and Context
Python's OSS language classification packages are weirdly, surprisingly... less-than-awesome. The best package I could find way back when was
cld2-cffi
, but it's largely unmaintained, the methodology is a bit dated, and it's problematic to install in certain dev environments.I wanted something that's easy to install, actively maintained, and gives good performance in terms of classification accuracy and speed. Turns out I had to make it myself.
How Has This Been Tested?
Types of changes
Checklist: