Open unhammer opened 1 year ago
How many lines of text do we need per language?
The current script only uses the first 100.000 lines of text for each corpus. This was based on experiments with Scandi languages which can have very similar spelling (and then increased a bit) – if you have a language that is quite different from the rest of the set then I think you can get away with quite a bit less. As the above comment shows we "just" have 35k lines for sme (whereas e.g. deu has 100k), but https://beta.apertium.org/apy/identifyLang?q=ja+leat still gets it right
We have trained model files lid.beta.ftz and lid.release.ftz in the repo for languages that were in the opus-100 corpus. We should get corpora for the languages that weren't there and retrain (preferably in a fairly reproducible way, see scripts in ./ft-train).
Corpus suggestions: https://github.com/apertium/apertium-apy/pull/207#issuecomment-1398455482
Missing in release:
Full missing-list for beta and relase: https://github.com/apertium/apertium-apy/blob/master/ft-train/download-extract-corpus#L56