Closed unhammer closed 1 year ago
As a default for lang_detect/full
replacing cld or new option? When trying now, I'm not able to even install cld with pip, it's giving lots of build failures compiling the cbits (
cld-errors.txt) – and on checking torro, it's not even using CLD, guess that means the old coverage-based version 😅
I guess if we can't even run CLD, that's all the more reason to switch default, though maybe I'm just missing something obvious?
In any case, the server currently spends about 3s on language detection of a fairly short paragraph, and with the coverage method that time grows with the number of languages installed (on my laptop, coverage with 8 monolinguals takes 1s per request). Coverage might be more accurate with longer texts (esp. since it's restricted to the pairs we have), but it often gets completely wrong language families with shorter texts (whereas fasttext errors tend to at least be neighbouring languages). If we switch to a faster method we could even consider suggesting different source languages (without the button being clicked) the way google etc. do.
@TinoDidriksen can we include the lid.prod.ftz file in /usr/share/apertium/ in the .deb?
can we include the lid.prod.ftz file in /usr/share/apertium/ in the .deb?
Add it to https://github.com/apertium/apertium-apy/blob/master/setup.py#L70
As a default for
lang_detect/full
replacing cld or new option? When trying now, I'm not able to even install cld with pip, it's giving lots of build failures compiling the cbits (
I was thinking as a new option but perhaps at this point it's worth ditching cld2 entirely and bumping a major version?
I wouldn't mind :)
Though it would be nice to first flesh out the corpora with at least the release languages. I've trained on up to 100k lines from the Opus-100 set, but some languages weren't covered there. @ftyers @jonorthwash is there a good collection covering these?:
(I'd prefer something with an url I can wget in the script if possible.)
I made a list of corpus sizes on my github when researching for an article: https://github.com/flammie/flammie.github.io/blob/main/languages-by-resources.markdown this is all scripted with opus api and wikimedia dumps so might have few more than opus-100? Here's a script it just counts size by http headers but could fetch data is easy change.
for sme there's a large corpus somewhere in The svn :-)
For crh and kaz, there are wikipedia corpora.
As part of this PR, we need documentation for how to add new languages.
I also want to be sure it'll work out of the box with no to minimal training—i.e., if someone's just setting something up for internal purposes using a non-public language module and doesn't need language detection, do they still have to do something special?
Training is really simple, you just need a big file like
$ head train
__label__mlt formalitajiet qabel il - ħruġ ta ' merkanzija
__label__zho 我只想知道你叫什么
__label__zho 9 . 秘书长报告 ( a / 56 / 800第35至43段 ) 讨论了联合国行政法庭规约与劳工组织行政法庭规约之间的差异 。
__label__afr databasise
__label__bul климент охридски " в скопие , македония .
__label__zho 反对 : 无 。
__label__spa adam llega a la presión .
__label__isl annađ hvort hagar hann sér vel eđa ég mala hann mélinu smærra .
__label__pol nie wcześniej .
__label__dan - turen er kun for mænd .
and run fasttext supervised
on it. The scripts I included turn opus-100 into this format (restricted to the langs from apy) and train on that, but I can try to make a more general script yeah, that'd be nice to have.
But can APy still be used without training?
APy can still be used without training. If you don't pass a fasttext model when starting, it will fall back to the coverage-based method.
Merging for now since it's more usable than without this and I don't know when I'll have time to do the rest of the things I'd like to include; I'll make issues of the remaining ones.
(I don't know what that 3.6.15 test is, can't find a reference to that number in the code, but the regular tests passed at least …)
Enabled on beta.apertium.org – where curl beta.apertium.org/identifyLang previously took 7s on a short snippet, it now returns in <0.3s
The current language id thing on the web site is slow and often wrong. Fasttext is pretty good (and could be made even better if we train with the languages we serve, just need to gather some text in each language) – and it's fast.
Futher things we could do: