Closed mapmeld closed 3 years ago
Thanks, this kind of information is great to have!
In a bit of rush right before the release, we went through as many languages as we could to have at least some recommendation in the quickstart, but since we weren't familiar with a lot of the resources we didn't have much to go on beyond download numbers. We'll have to test if all the models load/run correctly with the current spacy-transformers
, but as long as they do, we're very happy to update the recommendations.
All the recommended models seem to work fine in our basic tests, so we'll update everything except for dv
, since we don't have basic language support for it at this point.
(In general, you can combine lang = "xx"
with a transformer model for languages with English-ish tokenization (whitespace, similar punctuation symbols), but xx
is configured as ltr so displacy and some other options would probably be wonky.)
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hi - I was happy to see spaCy included one of my models (hindi-tpu-electra) in the quickstart_training_recommendations.yml file - but to be honest there are better models available - there's even a paper on it!
The file: https://github.com/explosion/spaCy/blob/master/spacy/cli/templates/quickstart_training_recommendations.yml
Hostility Detection in Hindi paper - the best the authors found was https://huggingface.co/ai4bharat/indic-bert - can you load an AlBERT-arch model here? Otherwise, there are MuRIL and LaBSE models which some users have uploaded onto HuggingFace but not officially hosted there yet.
If I can make other recommendations for South Asian languages which I've been benchmarking: