Change default language models

foxik commented 4 years ago

Hi,

the "default" models in languages.json file are suboptimal -- for example for Czech, the model cs is cs_cltt, which is just 860 sentences of layer texts, compared to cs_pdt containing 68.5k of general Czech text; the same for English, where en is en_partut with 1781 sentences compared to 12.5k sentences of en_ewt.

https://github.com/TakeLab/spacy-udpipe/blob/e12287fce111335b2bbe2f3e7c8429e6b1f62385/spacy_udpipe/languages.json#L14

The UDPipe service http://lindat.mff.cuni.cz/services/udpipe/ actually has a "best" model for every language (mostly the largest one; or second-largest if the largest does not contain all annotations). You can find the "selected" langauge models in https://github.com/ufal/udpipe/blob/master/releases/models.txt, where the first appearing model is the selected one (so for Czech cs_pdt is the first cs_* line, so it is the default model).

Also, even if you have for example cs as a link to cs-pdt model, you should also include explicit cs-pdt model (so that users can explicitly use cs-pdt even if it is the default Czech model).

foxik commented 4 years ago

Cc @arahusky

asajatovic commented 4 years ago

@foxik @arahusky thanks for the suggestion. You can see the updates in pull request #7 , please verify they are ok. If you have any other suggestions, feel free to open a pull request. :smiley:

foxik commented 4 years ago

Great, thank you very much!

TakeLab / spacy-udpipe

Change default language models #6