chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Include built-in language identification functionality #247

Closed bdewilde closed 5 years ago

bdewilde commented 5 years ago

Description

Test dataset performance is pretty darn good:

              precision    recall  f1-score   support

          ab       0.92      1.00      0.96        47
          af       0.95      0.98      0.96       829
          am       1.00      1.00      1.00       402
          an       0.95      0.96      0.96       402
          ar       1.00      1.00      1.00      3229
          av       0.97      0.97      0.97       400
          ay       0.94      0.97      0.95       400
          az       0.98      0.97      0.97       847
          ba       0.95      0.99      0.97       427
          be       0.98      0.99      0.98      1630
          bg       0.97      0.97      0.97      3424
          bn       0.98      0.98      0.98       367
          br       1.00      0.97      0.98      1047
          bs       0.62      0.74      0.68       439
          ca       0.97      0.93      0.95      1284
          ce       1.00      0.99      1.00       224
          ch       0.93      0.83      0.88        30
          co       1.00      0.99      0.99       400
          cs       0.97      0.99      0.98      3524
          cv       0.99      0.98      0.98       550
          cy       0.99      0.99      0.99       479
          da       0.93      0.94      0.94      3731
          de       0.99      0.99      0.99      3731
          el       1.00      1.00      1.00      3731
          en       0.99      0.99      0.99      3731
          eo       0.98      0.99      0.98      3731
          es       0.95      0.97      0.96      3731
          et       0.97      0.96      0.96       707
          eu       0.98      1.00      0.99      1209
          fa       0.98      0.99      0.99       400
          fi       1.00      0.99      1.00      3731
          fo       0.98      0.96      0.97       425
          fr       0.97      0.99      0.98      3731
          fy       0.99      0.95      0.97       418
          ga       1.00      0.97      0.98       621
          gd       0.98      0.99      0.99       532
          gl       0.91      0.88      0.90      1025
          gn       1.00      0.97      0.98       495
          gv       1.00      0.99      1.00       277
          ha       0.99      1.00      1.00       347
          he       1.00      1.00      1.00      3731
          hi       1.00      1.00      1.00      1264
          hr       0.72      0.49      0.58       995
          ht       0.94      0.95      0.95       217
          hu       1.00      0.99      1.00      3731
          hy       1.00      0.99      1.00       538
          ia       0.98      0.96      0.97      3731
          id       0.97      0.97      0.97      2308
          ie       0.92      0.93      0.93       763
          ig       1.00      0.99      0.99       289
          io       0.88      0.98      0.93      1041
          is       0.99      0.99      0.99      1997
          it       0.99      0.98      0.98      3731
          ja       1.00      1.00      1.00      3731
          jv       0.97      0.97      0.97       437
          ka       1.00      1.00      1.00       474
          kg       1.00      0.97      0.98        88
          ki       1.00      1.00      1.00       125
          kk       1.00      0.99      0.99       792
          kl       1.00      0.99      1.00       227
          km       0.99      1.00      1.00       117
          kn       1.00      0.96      0.98        28
          ko       1.00      1.00      1.00       504
          ku       1.00      0.98      0.99       462
          kv       0.99      0.96      0.98       400
          kw       1.00      0.97      0.99       462
          ky       1.00      0.98      0.99       414
          la       0.96      0.98      0.97      3731
          lb       0.98      0.97      0.97       410
          lg       1.00      0.99      0.99       400
          li       1.00      0.98      0.99       400
          ln       0.98      0.90      0.94       243
          lt       1.00      0.99      0.99      3731
          lv       0.99      0.99      0.99       400
          mg       1.00      1.00      1.00       403
          mi       1.00      0.99      0.99       457
          mk       0.96      0.98      0.97      3731
          ml       1.00      0.98      0.99       128
          mn       1.00      0.99      0.99       475
          mr       1.00      1.00      1.00      3731
          ms       0.83      0.85      0.84       400
          mt       0.99      1.00      0.99       417
          nb       0.88      0.84      0.86      1937
          nl       0.98      0.99      0.98      3731
          nn       0.92      0.89      0.91       549
          no       0.90      0.94      0.92       400
          nv       1.00      1.00      1.00       408
          oc       0.91      0.94      0.93      1112
          os       0.99      0.98      0.99       388
          pa       1.00      0.84      0.91        25
          pl       1.00      1.00      1.00      3731
          ps       1.00      0.99      0.99       404
          pt       0.98      0.96      0.97      3731
          qu       1.00      0.94      0.97       385
          rm       0.93      0.97      0.95       403
          rn       0.93      0.87      0.90        87
          ro       0.99      0.99      0.99      2958
          ru       0.98      0.98      0.98      3731
          rw       0.98      0.98      0.98       244
          sc       0.99      0.97      0.98       401
          sd       1.00      1.00      1.00       400
          se       0.99      0.98      0.98       208
          sk       0.99      0.85      0.91       731
          sl       0.92      0.93      0.92       500
          sn       1.00      0.99      0.99       402
          so       1.00      1.00      1.00       401
          sq       1.00      0.98      0.99       550
          sr       0.89      0.93      0.91      3731
          su       1.00      0.98      0.99       403
          sv       0.98      0.99      0.98      3731
          sw       1.00      0.99      0.99       400
          ta       1.00      0.96      0.98        50
          te       1.00      0.97      0.98        29
          tg       0.98      0.98      0.98       178
          th       1.00      0.96      0.98        92
          tk       0.96      0.97      0.97       638
          tl       0.99      1.00      1.00      2408
          to       1.00      1.00      1.00       401
          tr       0.99      1.00      0.99      3731
          tt       0.99      0.98      0.99      2376
          ty       1.00      0.95      0.98        22
          ug       1.00      1.00      1.00      1376
          uk       0.99      0.99      0.99      3731
          ur       1.00      0.99      0.99       652
          uz       0.96      0.93      0.95       236
          vi       1.00      1.00      1.00      2033
          vo       0.99      0.98      0.99       594
          wa       1.00      0.99      1.00       403
          wo       0.99      0.99      0.99       402
          yi       1.00      1.00      1.00       529
          yo       1.00      0.91      0.95        45
          za       1.00      0.95      0.97        60
          zh       0.99      0.99      0.99       400
          zu       1.00      1.00      1.00       318

   micro avg       0.98      0.98      0.98    165125
   macro avg       0.97      0.97      0.97    165125
weighted avg       0.98      0.98      0.98    165125

Motivation and Context

Python's OSS language classification packages are weirdly, surprisingly... less-than-awesome. The best package I could find way back when was cld2-cffi, but it's largely unmaintained, the methodology is a bit dated, and it's problematic to install in certain dev environments.

I wanted something that's easy to install, actively maintained, and gives good performance in terms of classification accuracy and speed. Turns out I had to make it myself.

How Has This Been Tested?

Types of changes

Checklist: