UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

English PTB to UD 2.0 #717

Open arademaker opened 3 years ago

arademaker commented 3 years ago

Does anyone know that is the best approach to convert a treebank in PTB format to UD 2.0? I found the page https://nlp.stanford.edu/software/stanford-dependencies.html, but it is not clear if the code supports UD 2.0. Suggestions are welcome.

amir-zeldes commented 3 years ago

You can use CoreNLP to convert PTB brackets for English to UD v1 (more or less, I think it represents a particular moment in time before 2.0 was released, but fairly close to v1 still), like this:

java -cp "*;" -Dfile.encoding=UTF-8 edu.stanford.nlp.trees.ud.UniversalDependenciesConverter -encoding UTF-8 -treeFile FILENAME

If you have a good conversion to Stanford Dependencies, you can also use DepEdit to convert the data to the current UD standard, more or less accurately depending on whether you have some additional entities (e.g. entities to resolve flat/compound better, etc.). This process is described and evaluated in this paper:

https://www.aclweb.org/anthology/W18-4918/

Finally you can also use a quick and dirty UD1>UD2 DepEdit script to transform the CoreNLP output from the command above to the current guidelines, but there are certain to be errors if you don't have the additional annotations from the paper. This basically just renames the labels that were changed in V2, rewires cc+conj, etc.:

pos=/VERB/;func=/nmod/  #1>#2   #2:func=obl
func=/.*/;func=/conj/;func=/cc/ #1>#2;#1>#3;#1.*#2  #2>#3
func=/dobj/ none    #1:func=obj
func=/mwe/  none    #1:func=fixed
func=/name|foreign/ none    #1:func=flat
func=/neg/  none    #1:func=advmod
func=/nsubjpass/    none    #1:func=nsubj:pass
func=/auxpass/  none    #1:func=aux:pass

If you want the code from the paper, let me know, but it is probably not 100% runnable out of the box (hardwired paths etc.)

sebschu commented 3 years ago

Since CoreNLP v4.0.0, the converter actually outputs UDv2!

You can run it, as suggested by Amir, using the command:

java -cp "*;" -Dfile.encoding=UTF-8 edu.stanford.nlp.trees.ud.UniversalDependenciesConverter -encoding UTF-8 -treeFile FILENAME
arademaker commented 3 years ago

Just to let people know... I got some errors when I run the UD validation script on the output data produced by CoreNLP 4.0 over the https://catalog.ldc.upenn.edu/LDC2013T19 dataset. the top 15 most frequent errors are:

41505  [L3 Syntax rel-upos-cop] 'cop' should be 'AUX' or 'PRON'/'DET' but it is 'VERB'
1780  [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'ADP'
 780  [L3 Syntax right-to-left-conj] Relation 'conj' must go left-to-right.
 568  [L3 Syntax rel-upos-aux] 'aux' should be 'AUX' but it is 'VERB'
 489  [L3 Syntax rel-upos-punct] 'punct' must be 'PUNCT' but it is 'SYM'
 320  [L3 Syntax rel-upos-nummod] 'nummod' should be 'NUM' but it is 'DET'
 304  [L3 Syntax rel-upos-nummod] 'nummod' should be 'NUM' but it is 'ADJ'
 234  [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'X'
 208  [L3 Syntax rel-upos-cc] 'cc' should not be 'DET'
 175  [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'NOUN'
 136  [L3 Syntax upos-rel-punct] 'PUNCT' must be 'punct' but it is 'conj'
  63  [L3 Syntax rel-upos-case] 'case' should not be 'ADJ'
  61  [L3 Syntax rel-upos-nummod] 'nummod' should be 'NUM' but it is 'ADV'
  48  [L3 Syntax rel-upos-mark] 'mark' should not be 'DET'
  46  [L3 Syntax right-to-left-appos] Relation 'appos' must go left-to-right.
nschneid commented 10 months ago

Any update on CoreNLP's PTB->UD conversion producing invalid UD? @sebschu @manning @AngledLuffa

AngledLuffa commented 10 months ago

that looks like a project! i will find time this year to start chipping away at that, but there's some work i simply can't put off any longer as i promised it for an upcoming industry event

AngledLuffa commented 10 months ago

actually, one way to speed this up would be to suggest a few command lines for doing the validation

nschneid commented 10 months ago

I think this should run validation for EWT:

$ cd UD_English-EWT
$ git clone https://github.com/UniversalDependencies/tools/
$ tools/validate.py --lang en en_ewt-ud-{dev,test,train}.conllu
AngledLuffa commented 4 months ago

Drilling down a bit into the most common error, that of a cop being AUX instead of VERB, here is a concrete example. In the EWT tree

( (S
    (NP-SBJ (DT The) (JJ actual) (NN vote))
    (VP (VBZ is)
      (ADJP-PRD
        (NP (DT a) (JJ little))
        (JJ confusing)))
    (. .)))

Our POS tag converter code has a comment:

https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/trees/UniversalPOSMapper.java https://github.com/stanfordnlp/CoreNLP/blob/3499d27e615c35702f23948e886a7389b5695c33/data/edu/stanford/nlp/upos/ENUniversalPOS.tsurgeon#L45

% Don't do this, we are now treating these as copular constructions

and that part of the conversion being commented out results in the tag VERB instead of AUX

1       The     the     DET     DT      _       3       det     _       _
2       actual  actual  ADJ     JJ      _       3       amod    _       _
3       vote    vote    NOUN    NN      _       7       nsubj   _       _
4       is      be      VERB    VBZ     _       7       cop     _       _
5       a       a       DET     DT      _       6       det     _       _
6       little  little  ADJ     JJ      _       7       obl:npmod       _       _
7       confusing       confusing       ADJ     JJ      _       0       root    _       _
8       .       .       PUNCT   .       _       7       punct   _       _

whereas the UD version of that sentence is

# sent_id = weblog-blogspot.com_aggressivevoicedaily_20060629164800_ENG_20060629_164800-0002
# text = The actual vote is a little confusing.
1       The     the     DET     DT      Definite=Def|PronType=Art       3       det     3:det   _
2       actual  actual  ADJ     JJ      Degree=Pos      3       amod    3:amod  _
3       vote    vote    NOUN    NN      Number=Sing     7       nsubj   7:nsubj _
4       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   7       cop     7:cop   _
5       a       a       DET     DT      Definite=Ind|PronType=Art       6       det     6:det   _
6       little  little  ADJ     JJ      Degree=Pos      7       obl:npmod       7:obl:npmod     _
7       confusing       confusing       ADJ     JJ      Degree=Pos      0       root    0:root  SpaceAfter=No
8       .       .       PUNCT   .       _       7       punct   7:punct _

First there's a somewhat unfortunate DRY violation here, in that the same rules are repeated in the tsurgeon file and in the constituency -> dependency converter rules:

https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/trees/UniversalEnglishGrammaticalRelations.java

So I'll need to figure out how extensive that problem is and how best to resolve it. There have been a few dependency converter fixes over the years which I assume are not reflected in any way in the POS converter. I also need to figure out how or why this particular rule about cop is being ignored and what to do to fix it.

The other errors probably have similar origins when it comes to UPOS tags being flagged by the validator. They'll each require some individual attention regarding what kind of tree is causing the error and how to fix.

AngledLuffa commented 4 months ago

for my own reference, i've been doing this to check a single tree:

java edu.stanford.nlp.trees.ud.UniversalDependenciesConverter -encoding UTF-8 -treeFile foo.mrg

or this for an entire slice of PTB:

java edu.stanford.nlp.trees.ud.UniversalDependenciesConverter -encoding UTF-8 -treeFile path/to/en_ptb3_test.mrg > en_ptb_test.conll
tools/validate.py --lang en en_ptb_test.conll --no-tree-text --max-err lots

So here's the next phrase in the dev set which isn't a cop AUX error

                (SBAR
                  (WHNP-1 (WDT which) )
                  (S
                    (NP-SBJ (-NONE- *T*-1) )
                    (VP (VBZ seems)
                      (PP (TO to)
                        (NP (PRP me) ))
                      (ADJP-PRD
                        (ADVP (NN sort) (IN of) )
                        (JJ draconian) ))))))))))

Our converter turns this into

16      which   which   PRON    WDT     _       17      nsubj   _       _
17      seems   seem    VERB    VBZ     _       10      acl:relcl       _       _
18      to      to      ADP     TO      _       19      case    _       _
19      me      I       PRON    PRP     _       17      obl     _       _
20      sort    sort    NOUN    NN      _       22      advmod  _       _
21      of      of      ADP     IN      _       20      case    _       _
22      draconian       draconian       ADJ     JJ      _       17      xcomp   _       _

The error given is

[Line 542 Sent 17 Node 20]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'NOUN'

however, I can find this sentence in EWT which has a similar structure

# sent_id = answers-20111107080027AA9zCIG_ans-0005
# text = its kind of expensive though
1-2     its     _       _       _       _       _       _       _       _
1       it      it      PRON    PRP     Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  5       nsubj   5:nsubj _
2       s       be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|Typo=Yes|VerbForm=Fin  5       cop     5:cop   CorrectForm='s
3       kind    kind    NOUN    NN      ExtPos=ADV|Number=Sing  5       advmod  5:advmod        _
4       of      of      ADP     IN      _       3       fixed   3:fixed _
5       expensive       expensive       ADJ     JJ      Degree=Pos      0       root    0:root  _
6       though  though  ADV     RB      _       5       advmod  5:advmod        _

so that's, quoting the French treebanks this time, kind of BS

although I do notice one difference, that "kind of " is fixed, as opposed to our converter, which turned "sort of " into case(sort, of)

Editing the dependencies to make that a fixed do in fact change that. So apparently that's the fix needed here... the converter needs to turn sort of, kind of, and whatever else matches into fixed instead of case

Continuing to dig into this, the converter has another component which breaks out fixed expressions prior to the tregex expressions run in UniversalEnglishGrammaticalRelations: https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/trees/CoordinationTransformer.java

hey, as it turns out, there's already a thing which does kind of:

https://github.com/stanfordnlp/CoreNLP/blob/3499d27e615c35702f23948e886a7389b5695c33/src/edu/stanford/nlp/trees/CoordinationTransformer.java#L677

    TregexPattern.compile("@ADVP < ((RB|NN=node1 < /^(?i)kind$/) $+ (IN|RB=node2 < /^(?i)of$/))"), //kind of

So this fix is actually rather simple, aside from all the spelunking needed. Just need to turn that kind into kind|sort and make sure that doesn't make a hash of everything else. Looking over the changes it makes to the PTB train set, it's all perfectly reasonable, such as "this project is sort of annoying" and other examples. And hey, not only has this fixed the error in the dev set I was looking at, it also fixes 5 of the 13,633 errors in the train set.

AngledLuffa commented 4 months ago

This time around_ADP, they're moving even faster was converted to advmod(time, around). the last time around received a similar treatment.

( (S
    (NP-TMP
      (NP (DT This) (NN time) )
      (ADVP (RP around) ))
    (, ,)
    (NP-SBJ (PRP they) )
    (VP (VBP 're)
      (VP (VBG moving)
        (ADVP (RB even) (RBR faster) )))
    (. .) ))

Here are some similar examples in EWT:

17      sometime        sometime        ADV     RB      _       15      advmod  15:advmod       _
18      around  around  ADP     IN      _       19      case    19:case _
19      mid-August      mid-August      PROPN   NNP     Number=Sing     17      obl     17:obl:around   SpaceAfter=No
# sent_id = email-enronsent40_01-0086
# text = - Arrv. Nice around noon?
1       -       -       PUNCT   NFP     _       2       punct   2:punct _
2       Arrv.   arrive  VERB    VB      Abbr=Yes|VerbForm=Inf   0       root    0:root  _
3       Nice    Nice    PROPN   NNP     Number=Sing     2       obl:npmod       2:obl:npmod     _
4       around  around  ADP     IN      _       5       case    5:case  _
5       noon    noon    NOUN    NN      Number=Sing     2       obl     2:obl:around    SpaceAfter=No
6       ?       ?       PUNCT   .       _       2       punct   2:punct _
# sent_id = email-enronsent40_01-0099
11      around  around  ADP     IN      _       12      case    12:case _
12      noon    noon    NOUN    NN      Number=Sing     10      obl     10:obl:around   SpaceAfter=No
23      actions action  NOUN    NNS     Number=Plur     20      conj    20:conj:and|28:nsubj    _
24      around  around  ADP     IN      _       27      case    27:case _
25      the     the     DET     DT      Definite=Def|PronType=Art       27      det     27:det  _
26      same    same    ADJ     JJ      Degree=Pos      27      amod    27:amod _
27      time    time    NOUN    NN      Number=Sing     23      nmod    23:nmod:around  _

Also looking through GUM a bit, it looks like this should be case? But I'm not 100% convinced that's correct. Any suggestions on what to do would be welcome.

AngledLuffa commented 4 months ago

double written out is being transformed by our converter into a nummod

[Line 940 Sent 35 Node 23]: [L3 Syntax rel-upos-nummod] 'nummod' should be 'NUM' but it is 'ADV'
21      received        receive VERB    VBN     _       4       ccomp   _       _
22      about   about   ADV     RB      _       23      advmod  _       _
23      double  double  ADV     RB      _       26      nummod  _       _
24      the     the     DET     DT      _       26      det     _       _
25      usual   usual   ADJ     JJ      _       26      amod    _       _
26      volume  volume  NOUN    NN      _       21      obj     _       _

This is because the converter gets a QP and thinks, ah, QP, that's obviously a nummod:

          (VP (VBN received)
            (NP
              (NP
                (QP (RB about) (RB double) )
                (DT the) (JJ usual) (NN volume) )
              (PP (IN of)
                (NP (NNS calls) )))
            (PP-TMP (IN over)
              (NP (DT the) (NN weekend) ))))))

If I look around for possibly similar usages of double in GUM and EWT, it would appear they are typically labeled as amod

# sent_id = GUM_conversation_blacksmithing-85
# text = We — that was kind of a double thing that, we had in — in another class, so it was kinda review for us.
7       a       a       DET     DT      Definite=Ind|PronType=Art       9       det     9:det   _
8       double  double  ADJ     JJ      Degree=Pos      9       amod    9:amod  _
9       thing   thing   NOUN    NN      Number=Sing     0       root    0:root|13:obj   _
# sent_id = answers-20111108083754AAEw5Xc_ans-0016
# text = Travelling on your own you would have to pay double as cabins are sold on the basis of double occupancy.
18      of      of      ADP     IN      _       20      case    20:case _
19      double  double  ADJ     JJ      Degree=Pos      20      amod    20:amod _
20      occupancy       occupancy       NOUN    NN      Number=Sing     17      nmod    17:nmod:of      SpaceAfter=No

However, I'm not sure this is 100% indicative, as those usages of double are a bit different. Closer is twice such as

# sent_id = newsgroup-groups.google.com_alt.animals_0e65f540816d780c_ENG_20041116_124800-0040
25      twice   twice   ADV     RB      NumForm=Word|NumType=Mult       27      advmod  27:advmod       _
26      that    that    ADV     RB      _       27      advmod  27:advmod       _
27      much    much    ADV     RB      _       22      advmod  22:advmod       _
# sent_id = answers-20111108105629AAiZUDY_ans-0049
3       twice   twice   ADV     RB      NumForm=Word|NumType=Mult       5       advmod  5:advmod        _
4       my      my      PRON    PRP$    Case=Gen|Number=Sing|Person=1|Poss=Yes|PronType=Prs     5       nmod:poss       5:nmod:poss     _
5       size    size    NOUN    NN      Number=Sing     0       root    0:root  _

I like those examples more, and they seem to suggest advmod. It is worth pointing out those are not in QPs in the original EWT trees.

Digging deeper and looking at half in the original EWT trees, half opened is not in a QP, whereas half of the furniture is. half of what A&E charges and half the price are not. less than half of the price IS. about half the time quoted is. half in this case is tagged DT/PDT as opposed to ADV/RB from double the usual volume. So that makes me wonder if that double was supposed to be a DT, or at least would be in the EWT paradigm? But then there's this usage of half, which also looks like a weird tagging to me:

# sent_id = weblog-blogspot.com_alaindewitt_20060924104100_ENG_20060924_104100-0028
# text = These 22 countries, with all their oil and natural resources, have a combined GDP smaller than that of Netherlands plus Belgium and equal to half of the GDP of California alone.
26      to      to      ADP     IN      _       27      case    27:case _
27      half    half    NOUN    NN      Number=Sing|NumForm=Word|NumType=Frac   25      obl     25:obl:to       _
28      of      of      ADP     IN      _       30      case    30:case _
29      the     the     DET     DT      Definite=Def|PronType=Art       30      det     30:det  _
30      GDP     GDP     PROPN   NNP     Number=Sing     27      nmod    27:nmod:of      _
31      of      of      ADP     IN      _       32      case    32:case _
32      California      California      PROPN   NNP     Number=Sing     30      nmod    30:nmod:of      _
33      alone   alone   ADV     RB      _       32      advmod  32:advmod       SpaceAfter=No

Effectively, once again, I have no idea what the ultimate resolution of this structure should be.

Hopefully this is somewhat illustrative as to why there is very little movement over time for this issue: there are probably zero people in the world in the center of the Venn diagram of "understands the converter", "feels comfortable making authoritative decisions about dependencies". and "has the time to make these changes"

nschneid commented 4 months ago

I am happy to weigh in to clarify the UD annotation policies. :) It is not surprising that this will be a nontrivial change as in the last couple of years there have been some notable general guidelines changes, some major revisions of English-specific policies (like relative clauses, pronouns, and passives), and hundreds of smaller corrections and policy changes. Some will be reflected in the main UD validator, and others are checked in English-specific validation scripts.

You are quite right that fixed expressions trigger exceptions to the validator rules. Almost all of these fixed expressions are documented here.

I've responded to your question about "this time around" in UniversalDependencies/UD_English-GUM#81.

My gut feeling for "double the price" is advmod. nummod should be limited to actual numbers. Is it possible to change the QP rule to check for a number (tagged NUM)? ( An exception: Currently ordinal dates e.g. "February 28th" have NOUN/nummod to attach the date to the month but this needs to be changed.)

amir-zeldes commented 4 months ago

See my response on "around" in UniversalDependencies/UD_English-GUM#81

I think in "received double the price", "double" is obj, and "the price" is a modifier of some kind, perhaps nmod:npmod is the best option. My reasoning is that you can drop "the price" and reconstruct it contextually with no change in meaning, but if you drop "double" you get a totally different reading:

Interrogative test:

amir-zeldes commented 4 months ago

zero people in the world in the center of the Venn diagram

That's probably true, but there are perhaps more grad students with ML skills who might be persuaded to work on postediting the converter output based on trying to match the final UD product in a corpus like EWT... I would actually think that an ML step might be needed anyway for really good results, since UD trees express some things that PTB trees just don't distinguish.

AngledLuffa commented 4 months ago

I would actually think that an ML step might be needed anyway for really good results, since UD trees express some things that PTB trees just don't distinguish.

I think part of the appeal of this converter is that it is fast, whereas as using an ML step to convert the trees would be orders slower. Certainly I would expect it to be more accurate, though.

AngledLuffa commented 4 months ago

IN vs RB vs RP in PTB is also giving me headaches for various short phrases. For example, close down_RB, drive down_IN, walk up_IN, laid out_RP, peer out_IN ....

This leads to an error

[Line 1340 Sent 52 Node 28]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'ADP'

in the phrase

                              (VP
                                (ADVP (RB just) )
                                (VBZ drives)
                                (NP (DT the) (NNS prices) )
                                (ADVP-DIR (IN down) )
                                (ADVP (RBR further) )))))))))))))))
23      which   which   PRON    WDT     _       25      nsubj   _       _
24      just    just    ADV     RB      _       25      advmod  _       _
25      drives  drive   VERB    VBZ     _       17      ccomp   _       _
26      the     the     DET     DT      _       27      det     _       _
27      prices  price   NOUN    NNS     _       25      obj     _       _
28      down    down    ADP     IN      _       25      advmod  _       _
29      further far     ADV     RBR     _       25      advmod  _       _
nschneid commented 4 months ago

Is (ADVP-DIR (IN down) ) an error in the Penn tree? I would have expected RB since it's an adverb phrase.

AngledLuffa commented 4 months ago

Is (ADVP-DIR (IN down) ) an error in the Penn tree?

I think so, but I don't think the converter is the right place to editorialize PTB tags. Perhaps there's some room to apply some heuristics such as a singleton ADVP is treated as a particle in the "go down", "take down", "drive down" senses... I do wonder how easy it will be to distinguish servers and coal miners going down, though, or the sentence "If you're not busy, why not drive down this weekend?"

nschneid commented 4 months ago

Yeah this is why I don't like the idiomaticity criterion. Probably best to trust the Penn tree and live with the occasional stray validator error caused by a Penn error.

AngledLuffa commented 4 months ago

In terms of fixed expressions, how about en masse? That occurs a couple times in PTB

23      with    with    ADP     IN      _       26      case    _       _
24      high    high    ADJ     JJ      _       26      amod    _       _
25      debt    debt    NOUN    NN      _       26      compound        _       _
26      ratios  ratio   NOUN    NNS     _       22      nmod    _       _
27      will    will    AUX     MD      _       29      aux     _       _
28      be      be      AUX     VB      _       29      aux:pass        _       _
29      dumped  dump    VERB    VBN     _       6       ccomp   _       _
30      en      en      ADP     IN      _       31      case    _       _
31      masse   masse   NOUN    NN      _       29      advmod  _       _
32      to      to      PART    TO      _       33      mark    _       _
33      discuss discuss VERB    VB      _       20      advcl   _       _
34      ,       ,       PUNCT   ,       _       33      punct   _       _
35      en      en      X       FW      _       36      compound        _       _
36      masse   masse   X       FW      _       33      obj     _       _
37      ,       ,       PUNCT   ,       _       33      punct   _       _
38      certain certain ADJ     JJ      _       40      amod    _       _
39      controversial   controversial   ADJ     JJ      _       40      amod    _       _
40      proposals       proposal        NOUN    NNS     _       33      obj     _       _
17      individuals     individual      NOUN    NNS     _       18      nsubj   _       _
18      ran     run     VERB    VBD     _       0       root    _       _
19      from    from    ADP     IN      _       21      case    _       _
20      the     the     DET     DT      _       21      det     _       _
21      market  market  NOUN    NN      _       18      obl     _       _
22      en      en      X       FW      _       23      compound        _       _
23      masse   masse   X       FW      _       18      advmod  _       _

Note the inconsistent tagging. I'd like to throw the PTB into space... but I do like fixing trivial errors in large projects

nschneid commented 4 months ago

"en masse" is a good one. Not fixed (that's limited to grammatical expressions) but it falls under our newly articulated policy on foreign expressions. My inclination would be to say the whole thing is a borrowed adverb-expression, so flat(en/ADV masse/ADV).

AngledLuffa commented 4 months ago

grammatical expressions

Whatever heuristics I have developed to understand these things, they are failing me in this interpretation of "en masse" as not being a grammatical expression. Would you clarify that a little bit?

Also, to be clear, en is the head here, right? advmod attachment in each of the three cases I posted above?

nschneid commented 4 months ago

fixed is for expressions that act like function words. "en masse" basically means 'on a large scale', so it contributes content beyond connecting together pieces of content.

Yes, "en" would be the technical head of "masse" because flat is always left to right.

amir-zeldes commented 4 months ago

I think part of the appeal of this converter is that it is fast, whereas as using an ML step to convert the trees would be orders slower. Certainly I would expect it to be more accurate, though.

Feels like something you could maybe do with a non-neural model, maybe even just a single decision tree, then it wouldn't be slow... but who knows?

AngledLuffa commented 4 months ago

In terms of appositives, here is an example where the converter does something the validator doesn't like:

((S
    (NP-SBJ
      (NP (NNP Edward) (NNP Eskandarian) )
      (, ,)
      (NP
        (NP (JJ former) (NN chairman) )
        (PP (IN of)
          (NP (NNP Della) (NNP Femina)
            (, ,)
            (NNP McNamee) (NNP WCRS\/Boston) )))
      (, ,) )
1       Edward  Edward  PROPN   NNP     _       2       compound        _       _
2       Eskandarian     Eskandarian     PROPN   NNP     _       13      nsubj   _       _
3       ,       ,       PUNCT   ,       _       2       punct   _       _
4       former  former  ADJ     JJ      _       5       amod    _       _
5       chairman        chairman        NOUN    NN      _       2       appos   _       _
6       of      of      ADP     IN      _       11      case    _       _
7       Della   Della   PROPN   NNP     _       11      compound        _       _
8       Femina  Femina  PROPN   NNP     _       11      compound        _       _
9       ,       ,       PUNCT   ,       _       11      punct   _       _
10      McNamee McNamee PROPN   NNP     _       11      appos   _       _
11      WCRS/Boston     WCRS/Boston     PROPN   NNP     _       5       nmod    _       _

The error given is

[Line 1681 Sent 68 Node 10]: [L3 Syntax right-to-left-appos] Relation 'appos' must go left-to-right.

Judging from examples such as this one, I take it the head is meant to be Della or Femina, not WCRS/Boston? Not sure it's the correct analysis either way, but I suppose the heads should be correct regardless

1       In      in      ADP     IN      _       2       case    2:case  _
2       Suwayrah        Suwayrah        PROPN   NNP     Number=Sing     11      obl     11:obl:in       SpaceAfter=No
3       ,       ,       PUNCT   ,       _       2       punct   2:punct _
4       Kut     Kut     PROPN   NNP     Number=Sing     5       compound        5:compound      _
5       Province        Province        PROPN   NNP     Number=Sing     2       appos   2:appos SpaceAfter=No

Similar errors happen for

( (S
    (NP-SBJ-1 (DT That) (NN account) )
    (VP (VBD had)
      (VP (VBN been)
        (VP (VBN handled)
          (NP (-NONE- *-1) )
          (PP (IN by)
            (NP-LGS (NNP Della) (NNP Femina)
              (, ,)
              (NNP McNamee) (NNP WCRS) )))))
    (. .) ))

Then there's the same error in this phrase:

17      Drexel  Drexel  PROPN   NNP     _       23      compound        _       _
18      Burnham Burnham PROPN   NNP     _       23      compound        _       _
19      Lambert Lambert PROPN   NNP     _       23      compound        _       _
20      (       (       PUNCT   -LRB-   _       21      punct   _       _
21      HK      HK      PROPN   NNP     _       23      appos   _       _
22      )       )       PUNCT   -RRB-   _       21      punct   _       _
23      Ltd.    Ltd.    PROPN   NNP     _       15      nmod    _       _
        (PP (IN for)
          (NP
            (NP (NNP Drexel) (NNP Burnham) (NNP Lambert)
              (PRN
                (-LRB- -LRB-)
                (NP-LOC (NNP HK) )
                (-RRB- -RRB-) )
              (NNP Ltd.) )
            (PP-LOC (IN in)
              (NP (NNP Hong) (NNP Kong) ))))))

Has there been some shift in the way noun phrases of names are headed? Our UniversalSemanticHeadFinder.java very clearly wants the rightmost NN / NNP to be the head, such as here

https://github.com/stanfordnlp/CoreNLP/blob/3499d27e615c35702f23948e886a7389b5695c33/src/edu/stanford/nlp/trees/UniversalSemanticHeadFinder.java#L141

I don't see how it's possible to have the appositive go in a right-to-left direction if Ltd is the head of Drexel Burnham Lambert Ltd and the appositive is in the middle of the phrase.

AngledLuffa commented 4 months ago

either/or just caused a minor bout of swearing which I think may have offended our babysitter for the night

I got this error:

[Line 2735 Sent 104 Node 7]: [L3 Syntax rel-upos-cc] 'cc' should not be 'DET'

from this sentence:

( (S
    (NP-SBJ (DT The) (JJ above) )
    (VP (VBZ represents)
      (NP
        (NP (DT a) (NN triumph) )
        (PP (IN of)
          (NP (DT either) (NN apathy) (CC or) (NN civility) ))))
    (. .) ))
# sent_id = 104
1       The     the     DET     DT      _       2       det     _       _
2       above   above   ADJ     JJ      _       3       nsubj   _       _
3       represents      represent       VERB    VBZ     _       0       root    _       _
4       a       a       DET     DT      _       5       det     _       _
5       triumph triumph NOUN    NN      _       3       obj     _       _
6       of      of      ADP     IN      _       8       case    _       _
7       either  either  DET     DT      _       8       cc:preconj      _       _
8       apathy  apathy  NOUN    NN      _       5       nmod    _       _
9       or      or      CCONJ   CC      _       10      cc      _       _
10      civility        civility        NOUN    NN      _       8       conj    _       _
11      .       .       PUNCT   .       _       3       punct   _       _

So the problem here is that the dependency is correct, but the PTB tag does not follow the UD EWT tagging standard. I don't think this is fixable unless we either allow this in the validator or exercise some editorial powers for the POS tags in the converter.

Example UD EWT phrase:

8       either  either  CCONJ   CC      _       9       cc:preconj      9:cc:preconj    _
9       NET     net     NOUN    NN      Number=Sing     0       root    0:root  SpaceAfter=No
10      -       -       PUNCT   HYPH    _       9       punct   9:punct SpaceAfter=No
11      2       2       NUM     CD      NumForm=Digit|NumType=Card      9       nummod  9:nummod        _
12      or      or      CCONJ   CC      _       13      cc      13:cc   _
13      NET     net     NOUN    NN      Number=Sing     9       conj    9:conj:or       SpaceAfter=No
14      -       -       PUNCT   HYPH    _       13      punct   13:punct        SpaceAfter=No
15      284     284     NUM     CD      NumForm=Digit|NumType=Card      13      nummod  13:nummod       SpaceAfter=No
16      .       .       PUNCT   .       _       9       punct   9:punct _

dependency is still cc:preconj, but the tag is now CC and not DT

nschneid commented 4 months ago

Right-to-left Appositives

[Line 1681 Sent 68 Node 10]: [L3 Syntax right-to-left-appos] Relation 'appos' must go left-to-right.

Googling around I see "Della Femina McNamee Chicago"—I guess it's a long name of a firm that happens to have a comma in it. Internally it should not have appos. "Della Femina" is almost certainly from a person's name so it should be flat(Della, Femina). I guess the rest should attach to that as flat or asyndetic coordination (conj).

The "(HK)" one doesn't look like an appositive either. An appositive is specifically where the elaborating information is another way of referring to the same entity. "HK" is presumably specifying the location of the entity, so it is something else, arguably compound (the default for nouns-premodifying-nouns) or parataxis (the default for parentheticals). "Kut Province" is also specifying the location of "Suwayrah" (a city); this is like the "city, state" construction (currently appos in EWT but that needs to be changed; a likely choice is nmod:desc, a new subtype we are in the process of adopting).

Either/or

That is clearly a tagging error. From the tag guidelines:

image
AngledLuffa commented 4 months ago

The "(HK)" one doesn't look like an appositive either. An appositive is specifically where the elaborating information is another way of referring to the same entity. "HK" is presumably specifying the location of the entity, so it is something else

I don't think these kind of distinctions are in the scope of a deterministic annotator, unfortunately

[Either/or] is clearly a tagging error. From the tag guidelines:

Indeed. The task at hand was to reduce the number of validator errors produced when converting PTB (or just trees in general) to conll, and that's not possible without either editing the tags, changing the validator, or making the converted trees worse.

nschneid commented 4 months ago

I think we have to live with the fact that PTB contains errors. My inclination would be to keep a whitelist of sentence IDs where we know the validator errors are due to a problem with the data, not the convertor.

For appositions, I think the rule would have to be that if "X , Y" is headed by Y, it is not an appositive. Maybe parataxis is the safest bet. (Or maybe the head rules can be improved but I don't know how.)

AngledLuffa commented 4 months ago

I think we have to live with the fact that PTB contains errors. My inclination would be to keep a whitelist of sentence IDs where we know the validator errors are due to a problem with the data, not the convertor.

An exception for either_DT with a cc:preconj dependency and an or later in the sentence might be reasonable. I guess the question would be whether it has more false negatives than we currently get in terms of false positives.

AngledLuffa commented 4 months ago

Based on these discussions, I would say the best we're going to accomplish here is to get rid of the extensive cop & aux disagreements with the verbs. With that in mind, there's a technical issue in our converter where it's getting the right (?) dependency but one of the matching patterns for changing the verb to AUX is firing when I don't think it should. There's a verb over verb pattern which frequently gets turned into advcl, but also gets captured by the aux patterns we have. Here's an example tree portion:

      (SBAR-ADV (RB even) (IN if)
        (S
          (NP-SBJ
            (NP (PRP$ your) (FW pilote) )
            (PP (IN in)
              (NP (JJ silly) (NN plaid) (NN beret) )))
          (VP (VBD kept)            <-----
            (VP (VBG pointing)
              (PRT (RP out) )
              (SBAR
                (WHADJP-2 (WRB how) (`` ``)
                  (ADJP (FW belle) )
                  ('' '') )
                (S
                  (NP-SBJ (PRP it) )
                  (DT all)
                  (VP (VBD was)
                    (ADJP-PRD (-NONE- *T*-2) )))))))))

The original version of the converter turns that into this:

23      even    even    ADV     RB      _       31      advmod  _       _
24      if      if      SCONJ   IN      _       31      mark    _       _
25      your    you     PRON    PRP$    _       26      nmod:poss       _       _
26      pilote  pilote  X       FW      _       31      nsubj   _       _
27      in      in      ADP     IN      _       30      case    _       _
28      silly   silly   ADJ     JJ      _       30      amod    _       _
29      plaid   plaid   NOUN    NN      _       30      compound        _       _
30      beret   beret   NOUN    NN      _       26      nmod    _       _
31      kept    keep    VERB    VBD     _       3       advcl   _       _    <-----
32      pointing        point   VERB    VBG     _       31      xcomp   _       _
33      out     out     ADP     RP      _       32      compound:prt    _       _
34      how     how     ADV     WRB     _       36      advmod  _       _
35      ``      ``      PUNCT   ``      _       36      punct   _       _
36      belle   belle   X       FW      _       40      dep     _       _
37      ''      ''      PUNCT   ''      _       36      punct   _       _
38      it      it      PRON    PRP     _       40      nsubj   _       _
39      all     all     DET     DT      _       40      dep     _       _
40      was     be      VERB    VBD     _       32      ccomp   _       _

Because that section also matches the aux pattern, though, the simplest way of upgrading the converter to add AUX tags for verb over verb auxiliaries captures this as well. I believe advcl is the correct dependency and using a UPOS tag here is incorrect. Is that true? Certainly the validator doesn't like it if I do that...

dan-zeman commented 4 months ago

I believe advcl is the correct dependency and using a UPOS tag here is incorrect. Is that true?

I believe advcl + VERB is correct for keep. I don't understand your note about "using a UPOS tag" being incorrect. Did you mean to say that an AUX tag would be incorrect? Yes, it would. Keep is not considered an auxiliary in English UD.

AngledLuffa commented 4 months ago

Indeed, that it exactly what I mean - meant to say using an AUX UPOS tag. The limitation here is that our converter has multiple deterministic rules which trigger for that tree section, one of them being the aux rule. Fortunately it prefers the advcl rule for the dependency, but because the aux rule fired as well, my recent changes to the xpos->upos conversion incorrectly update that UPOS to AUX

In general that should be a fixable problem, and anyway the statistics for PTB are much better with my update:

before:

Morpho errors: 2
Syntax errors: 13628
Warnings: 16

after:

Morpho errors: 178
Syntax errors: 2910
Warnings: 16

I think I should be able to clean up at least some of the new morpho errors I just created.

Thanks!

dan-zeman commented 4 months ago

the aux rule fired as well

I think aux- (and cop-) related problems could be fixed if you can restrict the rule to particular lemmas – AUX is a closed class.

AngledLuffa commented 4 months ago

We do that with several of the rules, such as SINV over (VP over aux verb, not next to -ing verb)

For some reason we don't do that for VP over VP over another verb, but in terms of the dependencies, the advcl rules from earlier took precedence... Generally when there's no specific explanation in the code for why it's that way, I feel a compulsion to at least check with @manning to see if he knows why grad students years ago originally wrote it that way before I barge in and change things

... edit in the morning: actually, adding a no self loop to the rule in question fixes all of the newly introduced "morpho" errors and somehow fixes one of the ones that existed before my recent changes, without changing the dependency trees themselves. I'll call that a success

AngledLuffa commented 4 months ago

Not sure if this is a legit change we could make the validator: there is a sentence with "mighta" not tokenized into "might have" the way I mighta expected it to be:

# sent_id = 6756
1       If      if      SCONJ   IN      _       4       mark    _       _
2       it      it      PRON    PRP     _       4       nsubj   _       _
3       had     have    AUX     VBD     _       4       aux     _       _
4       been    be      VERB    VBN     _       8       advcl   _       _
5       ,       ,       PUNCT   ,       _       8       punct   _       _
6       he      he      PRON    PRP     _       8       nsubj   _       _
7       mighta  mighta  AUX     MD      _       8       aux     _       _
8       hit     hit     VERB    VB      _       0       root    _       _
9       it      it      PRON    PRP     _       8       obj     _       _
10      out     out     ADP     IN      _       8       compound:prt    _       _
11      .       .       PUNCT   .       _       8       punct   _       _
12      ''      ''      PUNCT   ''      _       8       punct   _       _

Can we get mighta added to the list of words the validator allows for AUX in English?

AngledLuffa commented 4 months ago

The CoreNLP converter consistently changes whether or not into a structure like this:

38      whether whether SCONJ   IN      _       43      mark    _       _
39      or      or      CCONJ   CC      _       38      cc      _       _
40      not     not     ADV     RB      _       38      fixed   _       _
41      it      it      PRON    PRP     _       43      nsubj   _       _
42      is      be      AUX     VBZ     _       43      cop     _       _
43      constitutional  constitutional  ADJ     JJ      _       37      advcl   _       _
1       Whether whether SCONJ   IN      _       7       mark    _       _
2       or      or      CCONJ   CC      _       1       cc      _       _
3       not     not     ADV     RB      _       1       fixed   _       _
4       ``      ``      PUNCT   ``      _       7       punct   _       _
5       great   great   ADJ     JJ      _       6       amod    _       _
6       cases   case    NOUN    NNS     _       7       nsubj   _       _
7       make    make    VERB    VBP     _       18      dep     _       _
8       bad-law bad-law NOUN    NN      _       7       obj     _       _
9       ''      ''      PUNCT   ''      _       7       punct   _       _

but then, if it's further apart, it does this:

8       whether whether SCONJ   IN      _       13      mark    _       _
9       accounts        account NOUN    NNS     _       13      nsubj:pass      _       _
10      receivable      receivable      ADJ     JJ      _       9       amod    _       _
11      had     have    AUX     VBD     _       13      aux     _       _
12      been    be      AUX     VBN     _       13      aux:pass        _       _
13      paid    pay     VERB    VBN     _       7       ccomp   _       _
14      or      or      CCONJ   CC      _       13      cc      _       _
15      not     not     ADV     RB      _       13      advmod  _       _

In EWT, the whole thing of whether or not is labeled fixed when it appears next to each other:

# sent_id = email-enronsent01_02-0038
1       So      so      ADV     RB      _       3       advmod  3:advmod        _
2       I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      3       nsubj   3:nsubj _
3       question        question        VERB    VBP     Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin   0       root    0:root  _
4       whether whether SCONJ   IN      _       8       mark    8:mark  _
5       or      or      CCONJ   CC      _       4       fixed   4:fixed _
6       not     not     PART    RB      _       4       fixed   4:fixed _
7       you     you     PRON    PRP     Case=Nom|Person=2|PronType=Prs  8       nsubj   8:nsubj|10:nsubj:xsubj  _
8       want    want    VERB    VBP     Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin   3       ccomp   3:ccomp _
9       to      to      PART    TO      _       10      mark    10:mark _
10      publish publish VERB    VB      VerbForm=Inf    8       xcomp   8:xcomp _
11      info    info    NOUN    NN      Number=Sing     10      obj     10:obj  _

Is that the standard CoreNLP's converter should use?

AngledLuffa commented 4 months ago

Forgive my ignorance of what may be a standard dependency, but what should be the dependency between begin and notes in the following sentence?

# sent_id = 2750
1       And     and     CCONJ   CC      _       28      cc      _       _
2       while   while   SCONJ   IN      _       10      mark    _       _
3       customers       customer        NOUN    NNS     _       10      nsubj   _       _
4       such    such    ADJ     JJ      _       8       case    _       _
5       as      as      ADP     IN      _       4       fixed   _       _
6       steel   steel   NOUN    NN      _       8       compound        _       _
7       service service NOUN    NN      _       8       compound        _       _
8       centers center  NOUN    NNS     _       3       nmod    _       _
9       are     be      AUX     VBP     _       10      aux     _       _
10      continuing      continue        VERB    VBG     _       22      advcl   _       _
11      to      to      PART    TO      _       12      mark    _       _
12      reduce  reduce  VERB    VB      _       10      xcomp   _       _
13      inventories     inventory       NOUN    NNS     _       12      obj     _       _
14      through through ADP     IN      _       17      case    _       _
15      the     the     DET     DT      _       17      det     _       _
16      fourth  fourth  ADJ     JJ      _       17      amod    _       _
17      quarter quarter NOUN    NN      _       10      obl     _       _
18      ,       ,       PUNCT   ,       _       22      punct   _       _
19      they    they    PRON    PRP     _       22      nsubj   _       _
20      eventually      eventually      ADV     RB      _       22      advmod  _       _
21      will    will    AUX     MD      _       22      aux     _       _
22      begin   begin   VERB    VB      _       28      dep     _       _
23      stocking        stock   VERB    VBG     _       22      xcomp   _       _
24      up      up      ADP     RP      _       23      compound:prt    _       _
25      again   again   ADV     RB      _       23      advmod  _       _
26      ,       ,       PUNCT   ,       _       28      punct   _       _
27      he      he      PRON    PRP     _       28      nsubj   _       _
28      notes   note    VERB    VBZ     _       0       root    _       _
29      .       .       PUNCT   .       _       28      punct   _       _
amir-zeldes commented 4 months ago

using an AUX UPOS tag

If the only issue is the upos and the tree is correct, is it worth considering passing an xpos -> upos script over the data? FWIW GUM upos is generated from xpos and the tree, and it seems to work fine (I'm sure there are some issues here and there, but it's fairly battle tested at this point)

AngledLuffa commented 4 months ago

If the only issue is the upos and the tree is correct, is it worth considering passing an xpos -> upos script over the data?

That is basically what we do as well, except this was a non-battle-tested case...

https://github.com/stanfordnlp/CoreNLP/blob/dev/src/edu/stanford/nlp/trees/UniversalPOSMapper.java

dan-zeman commented 4 months ago

Not sure if this is a legit change we could make the validator: there is a sentence with "mighta" not tokenized into "might have" the way I mighta expected it to be:

# sent_id = 6756
1       If      if      SCONJ   IN      _       4       mark    _       _
2       it      it      PRON    PRP     _       4       nsubj   _       _
3       had     have    AUX     VBD     _       4       aux     _       _
4       been    be      VERB    VBN     _       8       advcl   _       _
5       ,       ,       PUNCT   ,       _       8       punct   _       _
6       he      he      PRON    PRP     _       8       nsubj   _       _
7       mighta  mighta  AUX     MD      _       8       aux     _       _
8       hit     hit     VERB    VB      _       0       root    _       _
9       it      it      PRON    PRP     _       8       obj     _       _
10      out     out     ADP     IN      _       8       compound:prt    _       _
11      .       .       PUNCT   .       _       8       punct   _       _
12      ''      ''      PUNCT   ''      _       8       punct   _       _

Can we get mighta added to the list of words the validator allows for AUX in English?

Wouldn't it be better to give it a lemma that is already on the list? Might seems to be a good candidate. And if it is a contraction of might have, then I would consider treating it as a multiword token and splitting it to might and have.

dan-zeman commented 4 months ago

Forgive my ignorance of what may be a standard dependency, but what should be the dependency between begin and notes in the following sentence?

# sent_id = 2750
1       And     and     CCONJ   CC      _       28      cc      _       _
2       while   while   SCONJ   IN      _       10      mark    _       _
3       customers       customer        NOUN    NNS     _       10      nsubj   _       _
4       such    such    ADJ     JJ      _       8       case    _       _
5       as      as      ADP     IN      _       4       fixed   _       _
6       steel   steel   NOUN    NN      _       8       compound        _       _
7       service service NOUN    NN      _       8       compound        _       _
8       centers center  NOUN    NNS     _       3       nmod    _       _
9       are     be      AUX     VBP     _       10      aux     _       _
10      continuing      continue        VERB    VBG     _       22      advcl   _       _
11      to      to      PART    TO      _       12      mark    _       _
12      reduce  reduce  VERB    VB      _       10      xcomp   _       _
13      inventories     inventory       NOUN    NNS     _       12      obj     _       _
14      through through ADP     IN      _       17      case    _       _
15      the     the     DET     DT      _       17      det     _       _
16      fourth  fourth  ADJ     JJ      _       17      amod    _       _
17      quarter quarter NOUN    NN      _       10      obl     _       _
18      ,       ,       PUNCT   ,       _       22      punct   _       _
19      they    they    PRON    PRP     _       22      nsubj   _       _
20      eventually      eventually      ADV     RB      _       22      advmod  _       _
21      will    will    AUX     MD      _       22      aux     _       _
22      begin   begin   VERB    VB      _       28      dep     _       _
23      stocking        stock   VERB    VBG     _       22      xcomp   _       _
24      up      up      ADP     RP      _       23      compound:prt    _       _
25      again   again   ADV     RB      _       23      advmod  _       _
26      ,       ,       PUNCT   ,       _       28      punct   _       _
27      he      he      PRON    PRP     _       28      nsubj   _       _
28      notes   note    VERB    VBZ     _       0       root    _       _
29      .       .       PUNCT   .       _       28      punct   _       _

It should be ccomp as per Amendment 3.

AngledLuffa commented 4 months ago

And if it is a contraction of might have, then I would consider treating it as a multiword token and splitting it to might and have.

I believe this is the correct interpretation (part of the woulda, coulda, shoulda family) In general we aren't editatorializing words by splitting them in the converter, so doing that here would be a unique case. I personally think "might" as the lemma would be wrong, since it drops the "have" part of the meaning. Ultimately it might be a case where the validator and the CoreNLP converter never agree

dan-zeman commented 4 months ago

In general we aren't editatorializing words by splitting them in the converter

But I suppose you could :-)

amir-zeldes commented 4 months ago

That is basically what we do as well, except this was a non-battle-tested case...

Feel free to diff its output with ours:

https://github.com/amir-zeldes/gum/blob/master/_build/utils/upos.ini

> pip install depedit
> python -m depedit -c upos.ini file.conllu > output.conllu
AngledLuffa commented 4 months ago

In general we aren't editatorializing words by splitting them in the converter

But I suppose you could :-)

Coulda...

but we already get enough "why does your tokenizer do this weird thing" git issues. Intentionally unaligning the tokens in the constituency & dependency graphs is just asking for giving me headaches

nschneid commented 4 months ago
AngledLuffa commented 4 months ago

"mighta": splitting -a off is reasonable in the abstract (cf. "gonna" => "gon na"). But I don't see any such tokens in EWT, and I would be loath to mess with the Penn tokenization. Maybe just leave as is (and keep "a" in the lemma as it affects the morphosyntax of the clause)?

Agreed. That leaves adding it to the validator as pretty much the only way to resolve that error, but not sure that's on the menu

nschneid commented 4 months ago

I'm willing to update the validator unless @amir-zeldes strongly objects.

nschneid commented 4 months ago

(I found just one relevant token in GUM, where "would" is contracted: "You'd a".)

Assuming we add mighta, musta, coulda, shoulda, woulda, oughtta to the validator, I guess the features should just be VerbForm=Fin (which applies to "might") plus Style=Coll (colloquial)?