UniversalDependencies / UD_Czech-CAC

Data from the Czech Academic Corpus.
Other
1 stars 0 forks source link

Spaces in column MISC #1

Open michmech opened 1 year ago

michmech commented 1 year ago

When I attempt to train a UDPipe model from this treebank, using UDPipe 1.2.0:

$ udpipe --train mymodel.udpipe UD_Czech-CAC-master/cs_cac-ud-train.conllu

I get the following error message:

Loading training data:
Cannot load training data from file 'UD_Czech-CAC-master/cs_cac-ud-train.conllu':
The CoNLL-U line
'39 vytvrditelné    vytvrditelný    ADJ AAFP1----1A---- Case=Nom|Degree=Pos|Gender=Fem|Number=Plur|Polarity=Pos 27  acl:relcl   27:acl:relcl    SpaceAfter=No|LDeriv=vytvrdit { přidat k tvrdit }'
contains spaces in column MISC!

Does this mean the treebank is broken? Or is there an option in UDPipe that I could use to get over this?

Thank you, Michal

dan-zeman commented 1 year ago

This line is surprising and I think the part { přidat k tvrdit } should not be there; nothing similar occurs anywhere else in the treebank.

However, spaces in MISC are not an error in general, so UDPipe should not die on them @foxik. (I think a leading or trailing whitespace would trigger a validation error, but there can be a space in the middle of a value, for example, if there is Latin transliteration of a FORM or LEMMA that contain a space.)

foxik commented 1 year ago

If I recall correctly, the spaces in MISC were not originally allowed in CoNLL-U v2 (maybe in the proposed version) -- so the implementation in UDPipe 1 did not originally allowed them, only in FORM and LEMMA. The spaces in MISC are allowed since https://github.com/ufal/udpipe/commit/9df115a6e8c0e71c94819f9007a6cebcbb363150, but we have not made a release since then (yes, it is long planned...). Once the release is made, it will work again; or it is possible to compile manually in the meantime.

Note that this affects also UDPipe 2 (which uses UDPipe 1 for tokenization).