explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.05k stars 4.4k forks source link

Updating recommended models for CLI Quickstart #7044

Closed mapmeld closed 3 years ago

mapmeld commented 3 years ago

Hi - I was happy to see spaCy included one of my models (hindi-tpu-electra) in the quickstart_training_recommendations.yml file - but to be honest there are better models available - there's even a paper on it!

The file: https://github.com/explosion/spaCy/blob/master/spacy/cli/templates/quickstart_training_recommendations.yml

Hostility Detection in Hindi paper - the best the authors found was https://huggingface.co/ai4bharat/indic-bert - can you load an AlBERT-arch model here? Otherwise, there are MuRIL and LaBSE models which some users have uploaded onto HuggingFace but not officially hosted there yet.

If I can make other recommendations for South Asian languages which I've been benchmarking:

adrianeboyd commented 3 years ago

Thanks, this kind of information is great to have!

In a bit of rush right before the release, we went through as many languages as we could to have at least some recommendation in the quickstart, but since we weren't familiar with a lot of the resources we didn't have much to go on beyond download numbers. We'll have to test if all the models load/run correctly with the current spacy-transformers, but as long as they do, we're very happy to update the recommendations.

adrianeboyd commented 3 years ago

All the recommended models seem to work fine in our basic tests, so we'll update everything except for dv, since we don't have basic language support for it at this point.

(In general, you can combine lang = "xx" with a transformer model for languages with English-ish tokenization (whitespace, similar punctuation symbols), but xx is configured as ltr so displacy and some other options would probably be wonky.)

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.