UniversalDependencies / UD_Portuguese-GSD

Brazilian Portuguese data from the Google Universal Dependency Treebanks 2.0.
Other
20 stars 7 forks source link

Missing features #23

Open wellington36 opened 2 years ago

wellington36 commented 2 years ago

We have several cases of empty features by upos:

% awk '$6 ~ /_/ && $4 !~ /PUNCT/ && $1 !~ /#/ {print $4}' *.conllu | sort | uniq -c | sort -n -r

  56311 NOUN
  51935 ADP
  32920 PROPN
  27579 VERB
  26306 DET
  21914 _
  15111 ADJ
  10986 CCONJ
   8286 ADV
   7389 PRON
   7363 AUX
   1009 SYM
    750 PART
    534 X
      2 SCONJ

In other words, we have few ADJ, NOUN, PROPN, VERB... with features:

% awk '$6 !~ /_/  && $4 !~ /PUNCT/ && $1 !~ /#/ {print $4}' *.conllu | sort | uniq -c | sort -n -r

  21350 DET
  12078 
   8491 NUM
   1489 ADV
      8 NOUN
      4 PROPN
      3 VERB
      2 ADJ
      1 PRON
martinpopel commented 2 years ago

All these empty (underscore) UPOS tags are in multiword tokens, so it is OK. MWTs must have empty UPOS according to the guidelines:

They have a FORM value – the string that occurs in the sentence – but have an underscore in all the remaining fields except MISC

Empty UPOS in words would not be allowed by the validator.

If you want to search in words only (excluding MWTs), you can use e.g. Udapi and cat *.conllu | udapy -TM util.Mark node='node.upos=="_"' | less -R (which shows no matches in case of UD_Portuguese-GSD).

wellington36 commented 2 years ago

All these empty (underscore) UPOS tags are in multiword tokens, so it is OK. MWTs must have empty UPOS according to the guidelines:

They have a FORM value – the string that occurs in the sentence – but have an underscore in all the remaining fields except MISC

Empty UPOS in words would not be allowed by the validator.

If you want to search in words only (excluding MWTs), you can use e.g. Udapi and cat *.conllu | udapy -TM util.Mark node='node.upos=="_"' | less -R (which shows no matches in case of UD_Portuguese-GSD).

@martinpopel, thanks for your comment. I corrected the initial comment.

martinpopel commented 2 years ago

Oh, I see. I haven't read the awk command carefully. I still think the MWTs should be filtered out from the query as these are OK (either by $1 !~ /#|-/ or $4 !~ /PUNCT|_/). FEATS annotation is optional in UD in general. There used to be rules/suggestions that every VERB has VerbForm, every NUM has NumType and every DET and PRON has PronType, but I cannot find these rules in the official UD documentation now. They are still included in ud.MarkBugs. Of course, there may be additional language-specific rules requiring FEATS also for other parts of speech.

BTW: regarding general validation errors, the official validator gives me many L3 and L5 errors, when applied on the files in the master and dev branch, although the validation report currently does not show any errors for UD_Portuguese-GSD.

The Udapi validation git checkout dev; cat *.conllu | udapy -TMA ud.MarkBugs | less -R also shows many errors (with no-VerbForm and no-PronType being the most frequent ones):

bugs = ud.MarkBugs Error Overview:
           aux-child          1
        punct-deprel          4
   appos-rightheaded          6
         punct-child          7
          multi-subj         14
             cc-upos         37
            cop-upos         39
           mark-upos         40
           multi-obj         45
          mark-child         66
   punct-nonproj-gap         77
       punct-nonproj        201
         punct-alpha        250
            det-upos        286
     cop-many-lemmas        414
          case-child        456
         no-VerbForm      27579
         no-PronType      33697
               TOTAL      63219
wellington36 commented 2 years ago

I didn't know this git checkout dev; cat *.conllu | udapy -TMA ud.MarkBugs | less -R of Udapi, interesting. I really need run this in https://github.com/UniversalDependencies/UD_Portuguese-Bosque.

wellington36 commented 2 years ago

@martinpopel Really this awk command captures unnecessary cases, I can improve it later and really the feature is not mandatory, following the corpus https://github.com/UniversalDependencies/UD_Portuguese-Bosque, such information in Portuguese helps in the analysis of certain components of the language, so is interresting have here.

dan-zeman commented 2 years ago

Yes, features are optional. But once the corpus has features, it would make sense that all PRON/DET have PronType, all NUM have NumType, and all VERB have VerbForm. The validator does not check it at present but it may issue warnings in the future.

BTW: regarding general validation errors, the official validator gives me many L3 and L5 errors, when applied on the files in the master and dev branch, although the validation report currently does not show any errors for UD_Portuguese-GSD.

I'm not sure what in the linked report you are looking at, but there are plenty of errors; this is copied directly from the report:

TOTAL 1734; L3 Syntax leaf-aux-cop 2; L3 Syntax leaf-cc 1; L3 Syntax leaf-mark-case 521; L3 Syntax leaf-punct 2; L3 Syntax punct-causes-nonproj 79; L3 Syntax punct-is-nonproj 201; L3 Syntax rel-upos-advmod 60; L3 Syntax rel-upos-aux 27; L3 Syntax rel-upos-case 83; L3 Syntax rel-upos-cc 26; L3 Syntax rel-upos-cop 39; L3 Syntax rel-upos-det 286; L3 Syntax rel-upos-expl 4; L3 Syntax rel-upos-mark 67; L3 Syntax rel-upos-nummod 37; L3 Syntax right-to-left-appos 6; L3 Syntax upos-rel-punct 4; L5 Morpho aux-lemma 134; L5 Syntax cop-lemma 155

Portuguese-GSD is a legacy treebank, meaning that it was allowed to release it even with these errors, but it is not a valid treebank. Spoiler: it is likely that the legacy status will not be granted forever. There are proposals to limit it to a certain number of years after the validator started reporting the error.

martinpopel commented 2 years ago

I'm not sure what in the linked report you are looking at, but there are plenty of errors

Oh, I see. I was in a hurry and perhaps confused UD_Portuguese-PUD with UD_Portuguese-GSD (unfortunately, not for the first time).

So while I agree that FEATS are useful and it would be nice to annotate them in GSD in the same way as in Bosque, it seems that fixing the errors reported by the validator should have a higher priority.

wellington36 commented 2 years ago

So while I agree that FEATS are useful and it would be nice to annotate them in GSD in the same way as in Bosque, it seems that fixing the errors reported by the validator should have a higher priority.

Really, knowing the amount of errors, I agree to work on these cases soon.

arademaker commented 2 years ago

Portuguese-GSD is a legacy treebank, meaning that it was allowed to release it even with these errors, but it is not a valid treebank. Spoiler: it is likely that the legacy status will not be granted forever. There are proposals to limit it to a certain number of years after the validator started reporting the error.

Yes @martinpopel, I also don't want to keep GSD as a legacy treebank. We are trying to put some time here to make it more compatible with the https://github.com/UniversalDependencies/UD_Portuguese-Bosque. I also agree that errors reported by the validate.py are the most important ones. BTW, how do you compare ud.MarkBugs block with the validate.py script?

martinpopel commented 2 years ago

how do you compare ud.MarkBugs block with the validate.py script?

They have different goals but they should be complementary.

All UD treebanks (except for the legacy ones) must pass validate.py. In ud.MarkBugs, I tried to focus on phenomena which are frequently annotation errors, but there may be rare exceptions which are OK, or rather questionable but not explicitly forbidden by the UD rules. ud.MarkBugs intentionally does not check low-level CoNLL-U format requirements - I did not want to duplicate the code of validate.py. Also the whole Udapi tries to be very forgiving (and fast) when reading CoNLL-U (because many of the problems can be fixed using Udapi) - it fails only on unrecoverable errors such as cycles in the basic dependencies.

I wrote ud.MarkBugs in 2017 and there are almost no updates since then. In 2017, I think there was no overlap, but meanwhile some of the checks were included in validate.py as well.

We can extend ud.MarkBugs with new checks if needed. It is being used for computing the stars indicating treebank quality, but that't just an informal visual aid (mostly for users who want to select the best treebank for a given language).