apertium / apertium-apy

📦 Apertium HTTP Server in Python
https://wiki.apertium.org/wiki/Apertium-apy
GNU General Public License v3.0
32 stars 42 forks source link

Option to use fasttext for language identification #207

Closed unhammer closed 1 year ago

unhammer commented 1 year ago

The current language id thing on the web site is slow and often wrong. Fasttext is pretty good (and could be made even better if we train with the languages we serve, just need to gather some text in each language) – and it's fast.

Futher things we could do:

coveralls commented 1 year ago

Coverage Status

Coverage: 50.657% (-0.8%) from 51.427% when pulling dac06c9bf5be4f673997b908e9ba7b5b0be93fc8 on fasttext into 5e4f5f9b7aded0571ed06068d169d3bb2d21e235 on master.

unhammer commented 1 year ago

As a default for lang_detect/full replacing cld or new option? When trying now, I'm not able to even install cld with pip, it's giving lots of build failures compiling the cbits ( cld-errors.txt) – and on checking torro, it's not even using CLD, guess that means the old coverage-based version 😅

I guess if we can't even run CLD, that's all the more reason to switch default, though maybe I'm just missing something obvious?

In any case, the server currently spends about 3s on language detection of a fairly short paragraph, and with the coverage method that time grows with the number of languages installed (on my laptop, coverage with 8 monolinguals takes 1s per request). Coverage might be more accurate with longer texts (esp. since it's restricted to the pairs we have), but it often gets completely wrong language families with shorter texts (whereas fasttext errors tend to at least be neighbouring languages). If we switch to a faster method we could even consider suggesting different source languages (without the button being clicked) the way google etc. do.

unhammer commented 1 year ago

@TinoDidriksen can we include the lid.prod.ftz file in /usr/share/apertium/ in the .deb?

TinoDidriksen commented 1 year ago

can we include the lid.prod.ftz file in /usr/share/apertium/ in the .deb?

Add it to https://github.com/apertium/apertium-apy/blob/master/setup.py#L70

sushain97 commented 1 year ago

As a default for lang_detect/full replacing cld or new option? When trying now, I'm not able to even install cld with pip, it's giving lots of build failures compiling the cbits (

I was thinking as a new option but perhaps at this point it's worth ditching cld2 entirely and bumping a major version?

unhammer commented 1 year ago

I wouldn't mind :)

Though it would be nice to first flesh out the corpora with at least the release languages. I've trained on up to 100k lines from the Opus-100 set, but some languages weren't covered there. @ftyers @jonorthwash is there a good collection covering these?:

(I'd prefer something with an url I can wget in the script if possible.)

flammie commented 1 year ago

I made a list of corpus sizes on my github when researching for an article: https://github.com/flammie/flammie.github.io/blob/main/languages-by-resources.markdown this is all scripted with opus api and wikimedia dumps so might have few more than opus-100? Here's a script it just counts size by http headers but could fetch data is easy change.

for sme there's a large corpus somewhere in The svn :-)

jonorthwash commented 1 year ago

For crh and kaz, there are wikipedia corpora.

As part of this PR, we need documentation for how to add new languages.

I also want to be sure it'll work out of the box with no to minimal training—i.e., if someone's just setting something up for internal purposes using a non-public language module and doesn't need language detection, do they still have to do something special?

unhammer commented 1 year ago

Training is really simple, you just need a big file like

$ head train 
__label__mlt formalitajiet qabel il - ħruġ ta '  merkanzija
__label__zho 我只想知道你叫什么
__label__zho 9 .  秘书长报告 ( a / 56 / 800第35至43段 ) 讨论了联合国行政法庭规约与劳工组织行政法庭规约之间的差异 。 
__label__afr databasise
__label__bul климент охридски  " в скопие ,  македония . 
__label__zho 反对 :  无 。 
__label__spa adam llega a la presión . 
__label__isl annađ hvort hagar hann sér vel eđa ég mala hann mélinu smærra . 
__label__pol nie wcześniej . 
__label__dan -  turen er kun for mænd . 

and run fasttext supervised on it. The scripts I included turn opus-100 into this format (restricted to the langs from apy) and train on that, but I can try to make a more general script yeah, that'd be nice to have.

jonorthwash commented 1 year ago

But can APy still be used without training?

unhammer commented 1 year ago

APy can still be used without training. If you don't pass a fasttext model when starting, it will fall back to the coverage-based method.

unhammer commented 1 year ago

Merging for now since it's more usable than without this and I don't know when I'll have time to do the rest of the things I'd like to include; I'll make issues of the remaining ones.

(I don't know what that 3.6.15 test is, can't find a reference to that number in the code, but the regular tests passed at least …)

unhammer commented 1 year ago

Enabled on beta.apertium.org – where curl beta.apertium.org/identifyLang previously took 7s on a short snippet, it now returns in <0.3s