Option to use fasttext for language identification

unhammer commented 1 year ago

The current language id thing on the web site is slow and often wrong. Fasttext is pretty good (and could be made even better if we train with the languages we serve, just need to gather some text in each language) – and it's fast.

Futher things we could do:

[x] perhaps not send all the input to fasttext (first few sentences should do it?)
[ ] combine with hand-collected sets of discriminating words (e.g. "ikke" is never nynorsk, "ikkje" is always nynorsk), perhaps https://github.com/google-research-datasets/TF-IDF-IIF-top100-wordlists
[ ] filter to just the languages we translate from server-side
[x] train our own fasttext model on just the languages we translate from
[ ] get corpus for languages that weren't in opus-100 and retrain fasttext
[x] preprocess input for classification in the same way the model preprocessed it
[ ] api response should have an outer level for metadata (e.g. did langid come from fasttext/cld/coverage, any errors), but then html-tools needs a change
[x] enable on beta.apertium.org (~frontend/docker/beta/run-apy.sh, apt python3-fasttext in Dockerfile )
[ ] enable on apertium.org (~frontend/docker/release/run-apy.sh, apt python3-fasttext in Dockerfile )

coveralls commented 1 year ago

Coverage: 50.657% (-0.8%) from 51.427% when pulling dac06c9bf5be4f673997b908e9ba7b5b0be93fc8 on fasttext into 5e4f5f9b7aded0571ed06068d169d3bb2d21e235 on master.

unhammer commented 1 year ago

As a default for lang_detect/full replacing cld or new option? When trying now, I'm not able to even install cld with pip, it's giving lots of build failures compiling the cbits ( cld-errors.txt) – and on checking torro, it's not even using CLD, guess that means the old coverage-based version 😅

I guess if we can't even run CLD, that's all the more reason to switch default, though maybe I'm just missing something obvious?

In any case, the server currently spends about 3s on language detection of a fairly short paragraph, and with the coverage method that time grows with the number of languages installed (on my laptop, coverage with 8 monolinguals takes 1s per request). Coverage might be more accurate with longer texts (esp. since it's restricted to the pairs we have), but it often gets completely wrong language families with shorter texts (whereas fasttext errors tend to at least be neighbouring languages). If we switch to a faster method we could even consider suggesting different source languages (without the button being clicked) the way google etc. do.

unhammer commented 1 year ago

@TinoDidriksen can we include the lid.prod.ftz file in /usr/share/apertium/ in the .deb?

TinoDidriksen commented 1 year ago

can we include the lid.prod.ftz file in /usr/share/apertium/ in the .deb?

Add it to https://github.com/apertium/apertium-apy/blob/master/setup.py#L70

sushain97 commented 1 year ago

As a default for lang_detect/full replacing cld or new option? When trying now, I'm not able to even install cld with pip, it's giving lots of build failures compiling the cbits (

I was thinking as a new option but perhaps at this point it's worth ditching cld2 entirely and bumping a major version?

unhammer commented 1 year ago

I wouldn't mind :)

Though it would be nice to first flesh out the corpora with at least the release languages. I've trained on up to 100k lines from the Opus-100 set, but some languages weren't covered there. @ftyers @jonorthwash is there a good collection covering these?:

Got only 35791 lines for oci oc
Got only 35907 lines for sme se
Got only 67312 lines for bel be
Got only 6961 lines for arg an
Got only 79927 lines for kaz kk
No corpus found for crh
No corpus found for frp
No corpus found for szl
No corpus found for zlm

(I'd prefer something with an url I can wget in the script if possible.)

flammie commented 1 year ago

I made a list of corpus sizes on my github when researching for an article: https://github.com/flammie/flammie.github.io/blob/main/languages-by-resources.markdown this is all scripted with opus api and wikimedia dumps so might have few more than opus-100? Here's a script it just counts size by http headers but could fetch data is easy change.

for sme there's a large corpus somewhere in The svn :-)

jonorthwash commented 1 year ago

For crh and kaz, there are wikipedia corpora.

As part of this PR, we need documentation for how to add new languages.

I also want to be sure it'll work out of the box with no to minimal training—i.e., if someone's just setting something up for internal purposes using a non-public language module and doesn't need language detection, do they still have to do something special?

unhammer commented 1 year ago

Training is really simple, you just need a big file like

$ head train 
__label__mlt formalitajiet qabel il - ħruġ ta '  merkanzija
__label__zho 我只想知道你叫什么
__label__zho 9 .  秘书长报告 （ a / 56 / 800第35至43段 ） 讨论了联合国行政法庭规约与劳工组织行政法庭规约之间的差异 。 
__label__afr databasise
__label__bul климент охридски  " в скопие ,  македония . 
__label__zho 反对 ：  无 。 
__label__spa adam llega a la presión . 
__label__isl annađ hvort hagar hann sér vel eđa ég mala hann mélinu smærra . 
__label__pol nie wcześniej . 
__label__dan -  turen er kun for mænd .

and run fasttext supervised on it. The scripts I included turn opus-100 into this format (restricted to the langs from apy) and train on that, but I can try to make a more general script yeah, that'd be nice to have.

jonorthwash commented 1 year ago

But can APy still be used without training?

unhammer commented 1 year ago

APy can still be used without training. If you don't pass a fasttext model when starting, it will fall back to the coverage-based method.

unhammer commented 1 year ago

Merging for now since it's more usable than without this and I don't know when I'll have time to do the rest of the things I'd like to include; I'll make issues of the remaining ones.

(I don't know what that 3.6.15 test is, can't find a reference to that number in the code, but the regular tests passed at least …)

unhammer commented 1 year ago

Enabled on beta.apertium.org – where curl beta.apertium.org/identifyLang previously took 7s on a short snippet, it now returns in <0.3s

apertium / apertium-apy

Option to use fasttext for language identification #207