UniversalDependencies / UD_Portuguese-Bosque

This Universal Dependencies (UD) Portuguese treebank.
Other
49 stars 11 forks source link

Issues with roots #41

Closed fcbr closed 7 years ago

fcbr commented 7 years ago

Related to #29.

Two issues reported, but they are related so I'm combining them here.

Example (problematic lines are 10 and 16).

1   Muito   muito   ADV ADV_@>A _   2   advmod  _   _
2   mais    mais    ADV ADV_@ADVL>  _   16  advmod  _   _
3   do_que  do_que  SCONJ   KS_@COM _   4   dep _   _
4   em  em  ADP PRP_@KOMP<  _   6   case    _   _
5   os  o   DET ART_M_P_@>N Gender=Masc|Number=Plur|PronType=Art    6   det _   _
6   tempos  tempo   NOUN    N_M_P_@P<   Gender=Masc|Number=Plur 2   nmod    _   _
7   em  em  ADP PRP_@N< _   9   case    _   _
8   a   o   DET ART_F_S_@>N Gender=Fem|Number=Sing|PronType=Art 9   det _   _
9   ditadura    ditadura    NOUN    N_F_S_@P<   Gender=Fem|Number=Sing  6   nmod    _   _
10  ,   ,   PUNCT   PU_@PU  _   0   punct   _   _
11  a   o   DET ART_F_S_@>N Gender=Fem|Number=Sing|PronType=Art 12  det _   _
12  solidez solidez NOUN    N_F_S_@SUBJ>    Gender=Fem|Number=Sing  16  nsubj   _   _
13  de  de  ADP PRP_@N< _   15  case    _   _
14  o   o   DET ART_M_S_@>N Gender=Masc|Number=Sing|PronType=Art    15  det _   _
15  PT  PT  PROPN   PROP_M_S_@P<    Gender=Masc|Number=Sing 12  nmod    _   _
16  está   estar   VERB    V_PR_3S_IND_@FS-STA Mood=Ind|Number=Sing|Person=3|Tense=Pres    0   root    _   _
17  ,   ,   PUNCT   PU_@PU  _   16  punct   _   _
18  agora   agora   ADV ADV_@<ADVL  _   16  advmod  _   _
19  ,   ,   PUNCT   PU_@PU  _   16  punct   _   _
20  ameaçada   ameaçar    VERB    V_PCP_F_S_@ICL-<SC  Gender=Fem|Number=Sing|VerbForm=Part    16  xcomp   _   _
21  .   .   PUNCT   PU_@PU  _   16  punct   _   _
EckhardBick commented 7 years ago

16 does have 'root' - just scroll the GitHub window to the right. punctuation attachment was not part of Bosque, but I doubt it would make linguistic sense to call a comma root :) Anyways, the problem was, the this is one of the areas, like copula+conjunct, where the converter is not just a converter, but actually adds information, so it can make errors. I just fixed the particular comma case with a default rule that should apply, if no others did, and hopefully prevent unattached commas. I'll send a new conll bosque once I have looked at the other open issues.

fcbr commented 7 years ago

You're right 16 is the correct root, the issue is indeed with 10 only.

fcbr commented 7 years ago

Please notice that PUNCT is just one of the cases where the HEAD is 0 but the DEPREL is not root, although it is indeed the more common one.

$ cat *.conll | awk '{print $7,$8}'| grep ^0 | sort | uniq -c | sort -nr
   8583 0 root
   1108 0 punct
    399 0 dep
    190 0 parataxis
     59 0 nmod
     24 0 advcl
     22 0 conj
     16 0 advmod
     12 0 nsubj
     12 0 cop
     11 0 xcomp
     11 0 aux
      8 0 cc
      6 0 acl:relcl
      3 0 neg
      2 0 mark
      1 0 ccomp
      1 0 acl
EckhardBick commented 7 years ago

This is NOT trivial, I've been working on it all night, and some cases can be fixed with rules, e.g. I forgot to cover ';', because I made rules mostly for commas and bracket pairing, but many cases are symptoms of errors in the original treebank, that were never fixed manually, or introduced during our efforts to add further layers of information. I'll see what I can fix over the weekend. But in any case your consistency check is a nice, indirect way of flagging treebank errors!

arademaker commented 7 years ago

@EckhardBick I suspect that at some point we will have to start to manual editing the files, just let us know when do you think would be the right time for that.

EckhardBick commented 7 years ago

I am already manually editing, but on the input files. The inconsistencies left are now almost all treebank errors, not conversion problems. Ideally, the transition point for starting editing on the UD files would be when there are only errors left that at least do not cause formal inconsistencies.

EckhardBick commented 7 years ago

I used the weekend to edit the treebank input files manually and to fix the last converter rule inconsistencies I could find. As a result, all 0-attachments have now the edge label "root", and there are now no sentences without a root (I think that wasn't checked yet). Of course, I could just add a rule that forces the "root" label for 0-attachments, but that would make the true causes (other errors somewhere else in the tree) invisible, and also lead to more than 1 root per sentence.

We could add such a rule for live Palavras runs from text all the way up to UD, but maybe for debugging/manual lediting it's best not to, and keep some inconsistencies to flat "editables". If it's too much to edit, you can always run a simple replacement script forcing "root" for 0-attachments and default "dep" for any later "root" in the same sentence.

I'll send a new UD bosque via mail.

arademaker commented 7 years ago

@EckhardBick I am assuming that this script is part of PALAVRAS now, something that takes the PALAVRAS dependencies and produces the UD dependencies. So it makes sense ou keep improving this script.

I really didn't understand your point above

Ideally, the transition point for starting editing on the UD files would be when there are only errors left that at least do not cause formal inconsistencies.

Could you give examples of erros that cause formal inconsistnecies and errors that do not cause formal inconsistencies?

If you are now manually editing the input files of these script, maybe it makes sense to have these input files in the repo and we can help you on editing those files If so, we will take the UD files as output files and never edit them directly. Are these files the CGDE files?

EckhardBick commented 7 years ago

formal inconsistencies are e.g. having functions other than root as dependents of 0, or a wrong number of tabs, or empty fields that are obligatory. I.e. like much of what was discussed here in GitHub. Ideally, hese things should not happen in the conversion process with otherwise correct input. Examples of other errors are calling subjects for objects, attaching conjuncts to the wrong heads, tripping over ellipsis etc. These are things that need to be found and addressed manually, if you want to avoid such erros. And UD has its own error risks, such as the complexity of combined copula, auxiliary and preposition raising.

I think I'm - hopefully - done with the tuning the conversion now, and if it's UD you are going to use, I think it should be the UD version that is edited manually. Using my input files for this would create confusion, first because the traditional linguistic dependenies and tags are so different from the UD ones, second because manual changes might be inconsistent with PALAVRAS' category set and conventions and hence the conversion grammar and third because there are various helper tags in the automatic output of PALAVRAS, which help this conversion (and others I've made over the years), but are not meant for end users. So yes, the conversion grammar plus the script are part of the PALAVRAS chain now for producing UD, but if UD is what you want to use, I think manual effort is better used on UD than on its input. It will also force editors to come to terms with, take decisions about and document the UD-specific complex constructions.

EckhardBick commented 7 years ago

On 10/17/2016 03:23 PM, Valeria de Paiva wrote:

great work @EckhardBick https://github.com/EckhardBick !

Thanks :)

@fcbr https://github.com/fcbr do you have any "super diff" that can tell us the extent of the differences between Zeman's conversion and the new one? number of sentences, e.g?

No, sorry.

or maybe I got the wrong end of the stick and this (finding the differences) is not important?

Not for me, my main interest is live rule-based annotation, and in this context creating a robust conversion grammar for turning Palavras output into UD. Since the Floresta treebanks were made by editing Palavras output, it makes sense for me to optimize this Bosque-UD conversion, from which both the treebank and Palavras will benefit. But I wasn't involved in Zeman's conversion of the older Bosque. I did some inspection last year, when I wrote the conversion grammar, but thought at the time it had isssues with the UD guidelines.

One example: For instance, 33% of all estar/ser in the conll conversion of Floresta are neither cop nor aux, but other functions (edge labels) where it's only 6% in my own conversion. The ideal in UD should be as low as possible, because ser and estar should either be aux dependents of main verbs or cop dependents of predicatives/complements. Maybe some conjunctions or the copula-on-copula problem I wrote about in GitHub, but nothing like 33%. Even my 6% is almost certainly too high and warrants manual editing. The reason why there is 33% non-cop-aux estar/ser in the conll UD is most likely that the UD conversion for copulas is VERY complex and needs a grammar, not just a script, so Zeman's conversion in many cases just didn't make any structural changes, leaving ser/estar as syntactic heads like in ordinary dependency grammar, only translating the function tags (edge labels) from Floresta, not its structure.

-- Eckhard

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/own-pt/bosque-UD/issues/41#issuecomment-254205752, or mute the thread https://github.com/notifications/unsubscribe-auth/AVn49aZX3vQWmJRKYmA01_F30-r5alSdks5q03bSgaJpZM4KWU7m.

Eckhard Bick, cand.med., dr.phil. University of Southern Denmark e-mail: eckhard.bick@gmail.com web: http://beta.visl.sdu.dk

arademaker commented 7 years ago

@EckhardBick, In the first comment of this thread, the "DEPREL must be 'root' if HEAD is 0" should be "DEPREL must be 'root' if and only if HEAD is 0", right?

$ cat *.conll | awk '{print $7,$8}'| grep ^0 | sort | uniq -c | sort -nr
9212 0 root

$ cat *.conll | awk '{print $7,$8}'| grep root | sort | uniq -c | sort -nr
9212 0 root
   2 1 root
   1 5 root
   1 2 root
vcvpaiva commented 7 years ago

@EckhardBick thanks for the explanation!

Since the Floresta treebanks were made by editing Palavras output, it makes sense for me to optimize this Bosque-UD conversion, from which both the treebank and Palavras will benefit.

Indeed! as far as the ser/estar copula issue, I have talked to Dan about it and he had a bug on his conversion that he corrected recently, see https://github.com/UniversalDependencies/UD_Portuguese/issues/3. but yes, there might be still lots of problems, copula is really a difficult issue and extremely pervasive. so indeed making these syntactic compliances work is a great advancement.

EckhardBick commented 7 years ago

I only checked for 1 root per sentence, and no non-root dependents of 0, but it's true, the roots that don't have 0 as head, should probably be parataxis or something. I'll check.

EckhardBick commented 7 years ago

Ok, fixed. New version on its way.

fcbr commented 7 years ago

Latest version fixes this.