Open wellington36 opened 2 years ago
All these empty (underscore) UPOS tags are in multiword tokens, so it is OK. MWTs must have empty UPOS according to the guidelines:
They have a FORM value – the string that occurs in the sentence – but have an underscore in all the remaining fields except MISC
Empty UPOS in words would not be allowed by the validator.
If you want to search in words only (excluding MWTs), you can use e.g. Udapi and
cat *.conllu | udapy -TM util.Mark node='node.upos=="_"' | less -R
(which shows no matches in case of UD_Portuguese-GSD).
All these empty (underscore) UPOS tags are in multiword tokens, so it is OK. MWTs must have empty UPOS according to the guidelines:
They have a FORM value – the string that occurs in the sentence – but have an underscore in all the remaining fields except MISC
Empty UPOS in words would not be allowed by the validator.
If you want to search in words only (excluding MWTs), you can use e.g. Udapi and
cat *.conllu | udapy -TM util.Mark node='node.upos=="_"' | less -R
(which shows no matches in case of UD_Portuguese-GSD).
@martinpopel, thanks for your comment. I corrected the initial comment.
Oh, I see. I haven't read the awk
command carefully. I still think the MWTs should be filtered out from the query as these are OK (either by $1 !~ /#|-/
or $4 !~ /PUNCT|_/
).
FEATS annotation is optional in UD in general. There used to be rules/suggestions that every VERB has VerbForm
, every NUM has NumType
and every DET and PRON has PronType
, but I cannot find these rules in the official UD documentation now. They are still included in ud.MarkBugs. Of course, there may be additional language-specific rules requiring FEATS also for other parts of speech.
BTW: regarding general validation errors, the official validator gives me many L3 and L5 errors, when applied on the files in the master and dev branch, although the validation report currently does not show any errors for UD_Portuguese-GSD.
The Udapi validation git checkout dev; cat *.conllu | udapy -TMA ud.MarkBugs | less -R
also shows many errors (with no-VerbForm
and no-PronType
being the most frequent ones):
bugs = ud.MarkBugs Error Overview:
aux-child 1
punct-deprel 4
appos-rightheaded 6
punct-child 7
multi-subj 14
cc-upos 37
cop-upos 39
mark-upos 40
multi-obj 45
mark-child 66
punct-nonproj-gap 77
punct-nonproj 201
punct-alpha 250
det-upos 286
cop-many-lemmas 414
case-child 456
no-VerbForm 27579
no-PronType 33697
TOTAL 63219
I didn't know this git checkout dev; cat *.conllu | udapy -TMA ud.MarkBugs | less -R
of Udapi, interesting. I really need run this in https://github.com/UniversalDependencies/UD_Portuguese-Bosque.
@martinpopel Really this awk command captures unnecessary cases, I can improve it later and really the feature is not mandatory, following the corpus https://github.com/UniversalDependencies/UD_Portuguese-Bosque, such information in Portuguese helps in the analysis of certain components of the language, so is interresting have here.
Yes, features are optional. But once the corpus has features, it would make sense that all PRON/DET have PronType, all NUM have NumType, and all VERB have VerbForm. The validator does not check it at present but it may issue warnings in the future.
BTW: regarding general validation errors, the official validator gives me many L3 and L5 errors, when applied on the files in the master and dev branch, although the validation report currently does not show any errors for UD_Portuguese-GSD.
I'm not sure what in the linked report you are looking at, but there are plenty of errors; this is copied directly from the report:
TOTAL 1734; L3 Syntax leaf-aux-cop 2; L3 Syntax leaf-cc 1; L3 Syntax leaf-mark-case 521; L3 Syntax leaf-punct 2; L3 Syntax punct-causes-nonproj 79; L3 Syntax punct-is-nonproj 201; L3 Syntax rel-upos-advmod 60; L3 Syntax rel-upos-aux 27; L3 Syntax rel-upos-case 83; L3 Syntax rel-upos-cc 26; L3 Syntax rel-upos-cop 39; L3 Syntax rel-upos-det 286; L3 Syntax rel-upos-expl 4; L3 Syntax rel-upos-mark 67; L3 Syntax rel-upos-nummod 37; L3 Syntax right-to-left-appos 6; L3 Syntax upos-rel-punct 4; L5 Morpho aux-lemma 134; L5 Syntax cop-lemma 155
Portuguese-GSD is a legacy treebank, meaning that it was allowed to release it even with these errors, but it is not a valid treebank. Spoiler: it is likely that the legacy status will not be granted forever. There are proposals to limit it to a certain number of years after the validator started reporting the error.
I'm not sure what in the linked report you are looking at, but there are plenty of errors
Oh, I see. I was in a hurry and perhaps confused UD_Portuguese-PUD with UD_Portuguese-GSD (unfortunately, not for the first time).
So while I agree that FEATS are useful and it would be nice to annotate them in GSD in the same way as in Bosque, it seems that fixing the errors reported by the validator should have a higher priority.
So while I agree that FEATS are useful and it would be nice to annotate them in GSD in the same way as in Bosque, it seems that fixing the errors reported by the validator should have a higher priority.
Really, knowing the amount of errors, I agree to work on these cases soon.
Portuguese-GSD is a legacy treebank, meaning that it was allowed to release it even with these errors, but it is not a valid treebank. Spoiler: it is likely that the legacy status will not be granted forever. There are proposals to limit it to a certain number of years after the validator started reporting the error.
Yes @martinpopel, I also don't want to keep GSD as a legacy treebank. We are trying to put some time here to make it more compatible with the https://github.com/UniversalDependencies/UD_Portuguese-Bosque. I also agree that errors reported by the validate.py are the most important ones. BTW, how do you compare ud.MarkBugs
block with the validate.py script?
how do you compare ud.MarkBugs block with the validate.py script?
They have different goals but they should be complementary.
All UD treebanks (except for the legacy ones) must pass validate.py
.
In ud.MarkBugs
, I tried to focus on phenomena which are frequently annotation errors, but there may be rare exceptions which are OK, or rather questionable but not explicitly forbidden by the UD rules.
ud.MarkBugs
intentionally does not check low-level CoNLL-U format requirements - I did not want to duplicate the code of validate.py
. Also the whole Udapi tries to be very forgiving (and fast) when reading CoNLL-U (because many of the problems can be fixed using Udapi) - it fails only on unrecoverable errors such as cycles in the basic dependencies.
I wrote ud.MarkBugs in 2017 and there are almost no updates since then. In 2017, I think there was no overlap, but meanwhile some of the checks were included in validate.py
as well.
We can extend ud.MarkBugs
with new checks if needed. It is being used for computing the stars indicating treebank quality, but that't just an informal visual aid (mostly for users who want to select the best treebank for a given language).
We have several cases of empty features by upos:
In other words, we have few ADJ, NOUN, PROPN, VERB... with features: