Closed foxik closed 4 years ago
Cc @arahusky
@foxik @arahusky thanks for the suggestion. You can see the updates in pull request #7 , please verify they are ok. If you have any other suggestions, feel free to open a pull request. :smiley:
Great, thank you very much!
Hi,
the "default" models in
languages.json
file are suboptimal -- for example for Czech, the modelcs
iscs_cltt
, which is just 860 sentences of layer texts, compared tocs_pdt
containing 68.5k of general Czech text; the same for English, whereen
isen_partut
with 1781 sentences compared to 12.5k sentences ofen_ewt
.https://github.com/TakeLab/spacy-udpipe/blob/e12287fce111335b2bbe2f3e7c8429e6b1f62385/spacy_udpipe/languages.json#L14
The UDPipe service http://lindat.mff.cuni.cz/services/udpipe/ actually has a "best" model for every language (mostly the largest one; or second-largest if the largest does not contain all annotations). You can find the "selected" langauge models in https://github.com/ufal/udpipe/blob/master/releases/models.txt, where the first appearing model is the selected one (so for Czech
cs_pdt
is the firstcs_*
line, so it is the default model).Also, even if you have for example
cs
as a link tocs-pdt
model, you should also include explicitcs-pdt
model (so that users can explicitly usecs-pdt
even if it is the default Czech model).