Open arademaker opened 3 years ago
You can use CoreNLP to convert PTB brackets for English to UD v1 (more or less, I think it represents a particular moment in time before 2.0 was released, but fairly close to v1 still), like this:
java -cp "*;" -Dfile.encoding=UTF-8 edu.stanford.nlp.trees.ud.UniversalDependenciesConverter -encoding UTF-8 -treeFile FILENAME
If you have a good conversion to Stanford Dependencies, you can also use DepEdit to convert the data to the current UD standard, more or less accurately depending on whether you have some additional entities (e.g. entities to resolve flat/compound better, etc.). This process is described and evaluated in this paper:
https://www.aclweb.org/anthology/W18-4918/
Finally you can also use a quick and dirty UD1>UD2 DepEdit script to transform the CoreNLP output from the command above to the current guidelines, but there are certain to be errors if you don't have the additional annotations from the paper. This basically just renames the labels that were changed in V2, rewires cc+conj, etc.:
pos=/VERB/;func=/nmod/ #1>#2 #2:func=obl
func=/.*/;func=/conj/;func=/cc/ #1>#2;#1>#3;#1.*#2 #2>#3
func=/dobj/ none #1:func=obj
func=/mwe/ none #1:func=fixed
func=/name|foreign/ none #1:func=flat
func=/neg/ none #1:func=advmod
func=/nsubjpass/ none #1:func=nsubj:pass
func=/auxpass/ none #1:func=aux:pass
If you want the code from the paper, let me know, but it is probably not 100% runnable out of the box (hardwired paths etc.)
Since CoreNLP v4.0.0, the converter actually outputs UDv2!
You can run it, as suggested by Amir, using the command:
java -cp "*;" -Dfile.encoding=UTF-8 edu.stanford.nlp.trees.ud.UniversalDependenciesConverter -encoding UTF-8 -treeFile FILENAME
Just to let people know... I got some errors when I run the UD validation script on the output data produced by CoreNLP 4.0 over the https://catalog.ldc.upenn.edu/LDC2013T19 dataset. the top 15 most frequent errors are:
41505 [L3 Syntax rel-upos-cop] 'cop' should be 'AUX' or 'PRON'/'DET' but it is 'VERB'
1780 [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'ADP'
780 [L3 Syntax right-to-left-conj] Relation 'conj' must go left-to-right.
568 [L3 Syntax rel-upos-aux] 'aux' should be 'AUX' but it is 'VERB'
489 [L3 Syntax rel-upos-punct] 'punct' must be 'PUNCT' but it is 'SYM'
320 [L3 Syntax rel-upos-nummod] 'nummod' should be 'NUM' but it is 'DET'
304 [L3 Syntax rel-upos-nummod] 'nummod' should be 'NUM' but it is 'ADJ'
234 [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'X'
208 [L3 Syntax rel-upos-cc] 'cc' should not be 'DET'
175 [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'NOUN'
136 [L3 Syntax upos-rel-punct] 'PUNCT' must be 'punct' but it is 'conj'
63 [L3 Syntax rel-upos-case] 'case' should not be 'ADJ'
61 [L3 Syntax rel-upos-nummod] 'nummod' should be 'NUM' but it is 'ADV'
48 [L3 Syntax rel-upos-mark] 'mark' should not be 'DET'
46 [L3 Syntax right-to-left-appos] Relation 'appos' must go left-to-right.
Any update on CoreNLP's PTB->UD conversion producing invalid UD? @sebschu @manning @AngledLuffa
that looks like a project! i will find time this year to start chipping away at that, but there's some work i simply can't put off any longer as i promised it for an upcoming industry event
actually, one way to speed this up would be to suggest a few command lines for doing the validation
I think this should run validation for EWT:
$ cd UD_English-EWT
$ git clone https://github.com/UniversalDependencies/tools/
$ tools/validate.py --lang en en_ewt-ud-{dev,test,train}.conllu
Drilling down a bit into the most common error, that of a cop
being AUX instead of VERB, here is a concrete example. In the EWT tree
( (S
(NP-SBJ (DT The) (JJ actual) (NN vote))
(VP (VBZ is)
(ADJP-PRD
(NP (DT a) (JJ little))
(JJ confusing)))
(. .)))
Our POS tag converter code has a comment:
https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/trees/UniversalPOSMapper.java https://github.com/stanfordnlp/CoreNLP/blob/3499d27e615c35702f23948e886a7389b5695c33/data/edu/stanford/nlp/upos/ENUniversalPOS.tsurgeon#L45
% Don't do this, we are now treating these as copular constructions
and that part of the conversion being commented out results in the tag VERB
instead of AUX
1 The the DET DT _ 3 det _ _
2 actual actual ADJ JJ _ 3 amod _ _
3 vote vote NOUN NN _ 7 nsubj _ _
4 is be VERB VBZ _ 7 cop _ _
5 a a DET DT _ 6 det _ _
6 little little ADJ JJ _ 7 obl:npmod _ _
7 confusing confusing ADJ JJ _ 0 root _ _
8 . . PUNCT . _ 7 punct _ _
whereas the UD version of that sentence is
# sent_id = weblog-blogspot.com_aggressivevoicedaily_20060629164800_ENG_20060629_164800-0002
# text = The actual vote is a little confusing.
1 The the DET DT Definite=Def|PronType=Art 3 det 3:det _
2 actual actual ADJ JJ Degree=Pos 3 amod 3:amod _
3 vote vote NOUN NN Number=Sing 7 nsubj 7:nsubj _
4 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 7 cop 7:cop _
5 a a DET DT Definite=Ind|PronType=Art 6 det 6:det _
6 little little ADJ JJ Degree=Pos 7 obl:npmod 7:obl:npmod _
7 confusing confusing ADJ JJ Degree=Pos 0 root 0:root SpaceAfter=No
8 . . PUNCT . _ 7 punct 7:punct _
First there's a somewhat unfortunate DRY violation here, in that the same rules are repeated in the tsurgeon file and in the constituency -> dependency converter rules:
So I'll need to figure out how extensive that problem is and how best to resolve it. There have been a few dependency converter fixes over the years which I assume are not reflected in any way in the POS converter. I also need to figure out how or why this particular rule about cop
is being ignored and what to do to fix it.
The other errors probably have similar origins when it comes to UPOS tags being flagged by the validator. They'll each require some individual attention regarding what kind of tree is causing the error and how to fix.
for my own reference, i've been doing this to check a single tree:
java edu.stanford.nlp.trees.ud.UniversalDependenciesConverter -encoding UTF-8 -treeFile foo.mrg
or this for an entire slice of PTB:
java edu.stanford.nlp.trees.ud.UniversalDependenciesConverter -encoding UTF-8 -treeFile path/to/en_ptb3_test.mrg > en_ptb_test.conll
tools/validate.py --lang en en_ptb_test.conll --no-tree-text --max-err lots
So here's the next phrase in the dev set which isn't a cop AUX
error
(SBAR
(WHNP-1 (WDT which) )
(S
(NP-SBJ (-NONE- *T*-1) )
(VP (VBZ seems)
(PP (TO to)
(NP (PRP me) ))
(ADJP-PRD
(ADVP (NN sort) (IN of) )
(JJ draconian) ))))))))))
Our converter turns this into
16 which which PRON WDT _ 17 nsubj _ _
17 seems seem VERB VBZ _ 10 acl:relcl _ _
18 to to ADP TO _ 19 case _ _
19 me I PRON PRP _ 17 obl _ _
20 sort sort NOUN NN _ 22 advmod _ _
21 of of ADP IN _ 20 case _ _
22 draconian draconian ADJ JJ _ 17 xcomp _ _
The error given is
[Line 542 Sent 17 Node 20]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'NOUN'
however, I can find this sentence in EWT which has a similar structure
# sent_id = answers-20111107080027AA9zCIG_ans-0005
# text = its kind of expensive though
1-2 its _ _ _ _ _ _ _ _
1 it it PRON PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs 5 nsubj 5:nsubj _
2 s be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|Typo=Yes|VerbForm=Fin 5 cop 5:cop CorrectForm='s
3 kind kind NOUN NN ExtPos=ADV|Number=Sing 5 advmod 5:advmod _
4 of of ADP IN _ 3 fixed 3:fixed _
5 expensive expensive ADJ JJ Degree=Pos 0 root 0:root _
6 though though ADV RB _ 5 advmod 5:advmod _
so that's, quoting the French treebanks this time, kind of BS
although I do notice one difference, that "kind of " is fixed, as opposed to our converter, which turned "sort of " into case(sort, of)
Editing the dependencies to make that a fixed
do in fact change that. So apparently that's the fix needed here... the converter needs to turn sort of
, kind of
, and whatever else matches into fixed
instead of case
Continuing to dig into this, the converter has another component which breaks out fixed
expressions prior to the tregex expressions run in UniversalEnglishGrammaticalRelations: https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/trees/CoordinationTransformer.java
hey, as it turns out, there's already a thing which does kind of
:
TregexPattern.compile("@ADVP < ((RB|NN=node1 < /^(?i)kind$/) $+ (IN|RB=node2 < /^(?i)of$/))"), //kind of
So this fix is actually rather simple, aside from all the spelunking needed. Just need to turn that kind
into kind|sort
and make sure that doesn't make a hash of everything else. Looking over the changes it makes to the PTB train set, it's all perfectly reasonable, such as "this project is sort of annoying" and other examples. And hey, not only has this fixed the error in the dev set I was looking at, it also fixes 5 of the 13,633 errors in the train set.
This time around_ADP, they're moving even faster
was converted to advmod(time, around)
. the last time around
received a similar treatment.
( (S
(NP-TMP
(NP (DT This) (NN time) )
(ADVP (RP around) ))
(, ,)
(NP-SBJ (PRP they) )
(VP (VBP 're)
(VP (VBG moving)
(ADVP (RB even) (RBR faster) )))
(. .) ))
Here are some similar examples in EWT:
17 sometime sometime ADV RB _ 15 advmod 15:advmod _
18 around around ADP IN _ 19 case 19:case _
19 mid-August mid-August PROPN NNP Number=Sing 17 obl 17:obl:around SpaceAfter=No
# sent_id = email-enronsent40_01-0086
# text = - Arrv. Nice around noon?
1 - - PUNCT NFP _ 2 punct 2:punct _
2 Arrv. arrive VERB VB Abbr=Yes|VerbForm=Inf 0 root 0:root _
3 Nice Nice PROPN NNP Number=Sing 2 obl:npmod 2:obl:npmod _
4 around around ADP IN _ 5 case 5:case _
5 noon noon NOUN NN Number=Sing 2 obl 2:obl:around SpaceAfter=No
6 ? ? PUNCT . _ 2 punct 2:punct _
# sent_id = email-enronsent40_01-0099
11 around around ADP IN _ 12 case 12:case _
12 noon noon NOUN NN Number=Sing 10 obl 10:obl:around SpaceAfter=No
23 actions action NOUN NNS Number=Plur 20 conj 20:conj:and|28:nsubj _
24 around around ADP IN _ 27 case 27:case _
25 the the DET DT Definite=Def|PronType=Art 27 det 27:det _
26 same same ADJ JJ Degree=Pos 27 amod 27:amod _
27 time time NOUN NN Number=Sing 23 nmod 23:nmod:around _
Also looking through GUM a bit, it looks like this should be case? But I'm not 100% convinced that's correct. Any suggestions on what to do would be welcome.
double
written out is being transformed by our converter into a nummod
[Line 940 Sent 35 Node 23]: [L3 Syntax rel-upos-nummod] 'nummod' should be 'NUM' but it is 'ADV'
21 received receive VERB VBN _ 4 ccomp _ _
22 about about ADV RB _ 23 advmod _ _
23 double double ADV RB _ 26 nummod _ _
24 the the DET DT _ 26 det _ _
25 usual usual ADJ JJ _ 26 amod _ _
26 volume volume NOUN NN _ 21 obj _ _
This is because the converter gets a QP
and thinks, ah, QP
, that's obviously a nummod
:
(VP (VBN received)
(NP
(NP
(QP (RB about) (RB double) )
(DT the) (JJ usual) (NN volume) )
(PP (IN of)
(NP (NNS calls) )))
(PP-TMP (IN over)
(NP (DT the) (NN weekend) ))))))
If I look around for possibly similar usages of double
in GUM and EWT, it would appear they are typically labeled as amod
# sent_id = GUM_conversation_blacksmithing-85
# text = We — that was kind of a double thing that, we had in — in another class, so it was kinda review for us.
7 a a DET DT Definite=Ind|PronType=Art 9 det 9:det _
8 double double ADJ JJ Degree=Pos 9 amod 9:amod _
9 thing thing NOUN NN Number=Sing 0 root 0:root|13:obj _
# sent_id = answers-20111108083754AAEw5Xc_ans-0016
# text = Travelling on your own you would have to pay double as cabins are sold on the basis of double occupancy.
18 of of ADP IN _ 20 case 20:case _
19 double double ADJ JJ Degree=Pos 20 amod 20:amod _
20 occupancy occupancy NOUN NN Number=Sing 17 nmod 17:nmod:of SpaceAfter=No
However, I'm not sure this is 100% indicative, as those usages of double
are a bit different. Closer is twice
such as
# sent_id = newsgroup-groups.google.com_alt.animals_0e65f540816d780c_ENG_20041116_124800-0040
25 twice twice ADV RB NumForm=Word|NumType=Mult 27 advmod 27:advmod _
26 that that ADV RB _ 27 advmod 27:advmod _
27 much much ADV RB _ 22 advmod 22:advmod _
# sent_id = answers-20111108105629AAiZUDY_ans-0049
3 twice twice ADV RB NumForm=Word|NumType=Mult 5 advmod 5:advmod _
4 my my PRON PRP$ Case=Gen|Number=Sing|Person=1|Poss=Yes|PronType=Prs 5 nmod:poss 5:nmod:poss _
5 size size NOUN NN Number=Sing 0 root 0:root _
I like those examples more, and they seem to suggest advmod
. It is worth pointing out those are not in QPs in the original EWT trees.
Digging deeper and looking at half
in the original EWT trees, half opened
is not in a QP, whereas half of the furniture
is. half of what A&E charges
and half the price
are not. less than half of the price
IS. about half the time quoted
is. half
in this case is tagged DT/PDT
as opposed to ADV/RB
from double the usual volume
. So that makes me wonder if that double
was supposed to be a DT
, or at least would be in the EWT paradigm? But then there's this usage of half
, which also looks like a weird tagging to me:
# sent_id = weblog-blogspot.com_alaindewitt_20060924104100_ENG_20060924_104100-0028
# text = These 22 countries, with all their oil and natural resources, have a combined GDP smaller than that of Netherlands plus Belgium and equal to half of the GDP of California alone.
26 to to ADP IN _ 27 case 27:case _
27 half half NOUN NN Number=Sing|NumForm=Word|NumType=Frac 25 obl 25:obl:to _
28 of of ADP IN _ 30 case 30:case _
29 the the DET DT Definite=Def|PronType=Art 30 det 30:det _
30 GDP GDP PROPN NNP Number=Sing 27 nmod 27:nmod:of _
31 of of ADP IN _ 32 case 32:case _
32 California California PROPN NNP Number=Sing 30 nmod 30:nmod:of _
33 alone alone ADV RB _ 32 advmod 32:advmod SpaceAfter=No
Effectively, once again, I have no idea what the ultimate resolution of this structure should be.
Hopefully this is somewhat illustrative as to why there is very little movement over time for this issue: there are probably zero people in the world in the center of the Venn diagram of "understands the converter", "feels comfortable making authoritative decisions about dependencies". and "has the time to make these changes"
I am happy to weigh in to clarify the UD annotation policies. :) It is not surprising that this will be a nontrivial change as in the last couple of years there have been some notable general guidelines changes, some major revisions of English-specific policies (like relative clauses, pronouns, and passives), and hundreds of smaller corrections and policy changes. Some will be reflected in the main UD validator, and others are checked in English-specific validation scripts.
You are quite right that fixed
expressions trigger exceptions to the validator rules. Almost all of these fixed expressions are documented here.
I've responded to your question about "this time around" in UniversalDependencies/UD_English-GUM#81.
My gut feeling for "double the price" is advmod. nummod should be limited to actual numbers. Is it possible to change the QP rule to check for a number (tagged NUM)? ( An exception: Currently ordinal dates e.g. "February 28th" have NOUN/nummod to attach the date to the month but this needs to be changed.)
See my response on "around" in UniversalDependencies/UD_English-GUM#81
I think in "received double the price", "double" is obj
, and "the price" is a modifier of some kind, perhaps nmod:npmod
is the best option. My reasoning is that you can drop "the price" and reconstruct it contextually with no change in meaning, but if you drop "double" you get a totally different reading:
Interrogative test:
zero people in the world in the center of the Venn diagram
That's probably true, but there are perhaps more grad students with ML skills who might be persuaded to work on postediting the converter output based on trying to match the final UD product in a corpus like EWT... I would actually think that an ML step might be needed anyway for really good results, since UD trees express some things that PTB trees just don't distinguish.
I would actually think that an ML step might be needed anyway for really good results, since UD trees express some things that PTB trees just don't distinguish.
I think part of the appeal of this converter is that it is fast, whereas as using an ML step to convert the trees would be orders slower. Certainly I would expect it to be more accurate, though.
IN vs RB vs RP in PTB is also giving me headaches for various short phrases. For example, close down_RB
, drive down_IN
, walk up_IN
, laid out_RP
, peer out_IN
....
This leads to an error
[Line 1340 Sent 52 Node 28]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'ADP'
in the phrase
(VP
(ADVP (RB just) )
(VBZ drives)
(NP (DT the) (NNS prices) )
(ADVP-DIR (IN down) )
(ADVP (RBR further) )))))))))))))))
23 which which PRON WDT _ 25 nsubj _ _
24 just just ADV RB _ 25 advmod _ _
25 drives drive VERB VBZ _ 17 ccomp _ _
26 the the DET DT _ 27 det _ _
27 prices price NOUN NNS _ 25 obj _ _
28 down down ADP IN _ 25 advmod _ _
29 further far ADV RBR _ 25 advmod _ _
Is (ADVP-DIR (IN down) )
an error in the Penn tree? I would have expected RB since it's an adverb phrase.
Is
(ADVP-DIR (IN down) )
an error in the Penn tree?
I think so, but I don't think the converter is the right place to editorialize PTB tags. Perhaps there's some room to apply some heuristics such as a singleton ADVP is treated as a particle in the "go down", "take down", "drive down" senses... I do wonder how easy it will be to distinguish servers and coal miners going down, though, or the sentence "If you're not busy, why not drive down this weekend?"
Yeah this is why I don't like the idiomaticity criterion. Probably best to trust the Penn tree and live with the occasional stray validator error caused by a Penn error.
In terms of fixed
expressions, how about en masse
? That occurs a couple times in PTB
23 with with ADP IN _ 26 case _ _
24 high high ADJ JJ _ 26 amod _ _
25 debt debt NOUN NN _ 26 compound _ _
26 ratios ratio NOUN NNS _ 22 nmod _ _
27 will will AUX MD _ 29 aux _ _
28 be be AUX VB _ 29 aux:pass _ _
29 dumped dump VERB VBN _ 6 ccomp _ _
30 en en ADP IN _ 31 case _ _
31 masse masse NOUN NN _ 29 advmod _ _
32 to to PART TO _ 33 mark _ _
33 discuss discuss VERB VB _ 20 advcl _ _
34 , , PUNCT , _ 33 punct _ _
35 en en X FW _ 36 compound _ _
36 masse masse X FW _ 33 obj _ _
37 , , PUNCT , _ 33 punct _ _
38 certain certain ADJ JJ _ 40 amod _ _
39 controversial controversial ADJ JJ _ 40 amod _ _
40 proposals proposal NOUN NNS _ 33 obj _ _
17 individuals individual NOUN NNS _ 18 nsubj _ _
18 ran run VERB VBD _ 0 root _ _
19 from from ADP IN _ 21 case _ _
20 the the DET DT _ 21 det _ _
21 market market NOUN NN _ 18 obl _ _
22 en en X FW _ 23 compound _ _
23 masse masse X FW _ 18 advmod _ _
Note the inconsistent tagging. I'd like to throw the PTB into space... but I do like fixing trivial errors in large projects
"en masse" is a good one. Not fixed
(that's limited to grammatical expressions) but it falls under our newly articulated policy on foreign expressions. My inclination would be to say the whole thing is a borrowed adverb-expression, so flat(en/ADV masse/ADV).
grammatical expressions
Whatever heuristics I have developed to understand these things, they are failing me in this interpretation of "en masse" as not being a grammatical expression. Would you clarify that a little bit?
Also, to be clear, en
is the head here, right? advmod
attachment in each of the three cases I posted above?
fixed
is for expressions that act like function words. "en masse" basically means 'on a large scale', so it contributes content beyond connecting together pieces of content.
Yes, "en" would be the technical head of "masse" because flat
is always left to right.
I think part of the appeal of this converter is that it is fast, whereas as using an ML step to convert the trees would be orders slower. Certainly I would expect it to be more accurate, though.
Feels like something you could maybe do with a non-neural model, maybe even just a single decision tree, then it wouldn't be slow... but who knows?
In terms of appositives, here is an example where the converter does something the validator doesn't like:
((S
(NP-SBJ
(NP (NNP Edward) (NNP Eskandarian) )
(, ,)
(NP
(NP (JJ former) (NN chairman) )
(PP (IN of)
(NP (NNP Della) (NNP Femina)
(, ,)
(NNP McNamee) (NNP WCRS\/Boston) )))
(, ,) )
1 Edward Edward PROPN NNP _ 2 compound _ _
2 Eskandarian Eskandarian PROPN NNP _ 13 nsubj _ _
3 , , PUNCT , _ 2 punct _ _
4 former former ADJ JJ _ 5 amod _ _
5 chairman chairman NOUN NN _ 2 appos _ _
6 of of ADP IN _ 11 case _ _
7 Della Della PROPN NNP _ 11 compound _ _
8 Femina Femina PROPN NNP _ 11 compound _ _
9 , , PUNCT , _ 11 punct _ _
10 McNamee McNamee PROPN NNP _ 11 appos _ _
11 WCRS/Boston WCRS/Boston PROPN NNP _ 5 nmod _ _
The error given is
[Line 1681 Sent 68 Node 10]: [L3 Syntax right-to-left-appos] Relation 'appos' must go left-to-right.
Judging from examples such as this one, I take it the head is meant to be Della or Femina, not WCRS/Boston? Not sure it's the correct analysis either way, but I suppose the heads should be correct regardless
1 In in ADP IN _ 2 case 2:case _
2 Suwayrah Suwayrah PROPN NNP Number=Sing 11 obl 11:obl:in SpaceAfter=No
3 , , PUNCT , _ 2 punct 2:punct _
4 Kut Kut PROPN NNP Number=Sing 5 compound 5:compound _
5 Province Province PROPN NNP Number=Sing 2 appos 2:appos SpaceAfter=No
Similar errors happen for
( (S
(NP-SBJ-1 (DT That) (NN account) )
(VP (VBD had)
(VP (VBN been)
(VP (VBN handled)
(NP (-NONE- *-1) )
(PP (IN by)
(NP-LGS (NNP Della) (NNP Femina)
(, ,)
(NNP McNamee) (NNP WCRS) )))))
(. .) ))
Then there's the same error in this phrase:
17 Drexel Drexel PROPN NNP _ 23 compound _ _
18 Burnham Burnham PROPN NNP _ 23 compound _ _
19 Lambert Lambert PROPN NNP _ 23 compound _ _
20 ( ( PUNCT -LRB- _ 21 punct _ _
21 HK HK PROPN NNP _ 23 appos _ _
22 ) ) PUNCT -RRB- _ 21 punct _ _
23 Ltd. Ltd. PROPN NNP _ 15 nmod _ _
(PP (IN for)
(NP
(NP (NNP Drexel) (NNP Burnham) (NNP Lambert)
(PRN
(-LRB- -LRB-)
(NP-LOC (NNP HK) )
(-RRB- -RRB-) )
(NNP Ltd.) )
(PP-LOC (IN in)
(NP (NNP Hong) (NNP Kong) ))))))
Has there been some shift in the way noun phrases of names are headed? Our UniversalSemanticHeadFinder.java very clearly wants the rightmost NN / NNP to be the head, such as here
I don't see how it's possible to have the appositive go in a right-to-left direction if Ltd
is the head of Drexel Burnham Lambert Ltd
and the appositive is in the middle of the phrase.
either/or just caused a minor bout of swearing which I think may have offended our babysitter for the night
I got this error:
[Line 2735 Sent 104 Node 7]: [L3 Syntax rel-upos-cc] 'cc' should not be 'DET'
from this sentence:
( (S
(NP-SBJ (DT The) (JJ above) )
(VP (VBZ represents)
(NP
(NP (DT a) (NN triumph) )
(PP (IN of)
(NP (DT either) (NN apathy) (CC or) (NN civility) ))))
(. .) ))
# sent_id = 104
1 The the DET DT _ 2 det _ _
2 above above ADJ JJ _ 3 nsubj _ _
3 represents represent VERB VBZ _ 0 root _ _
4 a a DET DT _ 5 det _ _
5 triumph triumph NOUN NN _ 3 obj _ _
6 of of ADP IN _ 8 case _ _
7 either either DET DT _ 8 cc:preconj _ _
8 apathy apathy NOUN NN _ 5 nmod _ _
9 or or CCONJ CC _ 10 cc _ _
10 civility civility NOUN NN _ 8 conj _ _
11 . . PUNCT . _ 3 punct _ _
So the problem here is that the dependency is correct, but the PTB tag does not follow the UD EWT tagging standard. I don't think this is fixable unless we either allow this in the validator or exercise some editorial powers for the POS tags in the converter.
Example UD EWT phrase:
8 either either CCONJ CC _ 9 cc:preconj 9:cc:preconj _
9 NET net NOUN NN Number=Sing 0 root 0:root SpaceAfter=No
10 - - PUNCT HYPH _ 9 punct 9:punct SpaceAfter=No
11 2 2 NUM CD NumForm=Digit|NumType=Card 9 nummod 9:nummod _
12 or or CCONJ CC _ 13 cc 13:cc _
13 NET net NOUN NN Number=Sing 9 conj 9:conj:or SpaceAfter=No
14 - - PUNCT HYPH _ 13 punct 13:punct SpaceAfter=No
15 284 284 NUM CD NumForm=Digit|NumType=Card 13 nummod 13:nummod SpaceAfter=No
16 . . PUNCT . _ 9 punct 9:punct _
dependency is still cc:preconj
, but the tag is now CC
and not DT
[Line 1681 Sent 68 Node 10]: [L3 Syntax right-to-left-appos] Relation 'appos' must go left-to-right.
Googling around I see "Della Femina McNamee Chicago"—I guess it's a long name of a firm that happens to have a comma in it. Internally it should not have appos
. "Della Femina" is almost certainly from a person's name so it should be flat(Della, Femina). I guess the rest should attach to that as flat
or asyndetic coordination (conj
).
The "(HK)" one doesn't look like an appositive either. An appositive is specifically where the elaborating information is another way of referring to the same entity. "HK" is presumably specifying the location of the entity, so it is something else, arguably compound
(the default for nouns-premodifying-nouns) or parataxis
(the default for parentheticals). "Kut Province" is also specifying the location of "Suwayrah" (a city); this is like the "city, state" construction (currently appos
in EWT but that needs to be changed; a likely choice is nmod:desc
, a new subtype we are in the process of adopting).
That is clearly a tagging error. From the tag guidelines:
The "(HK)" one doesn't look like an appositive either. An appositive is specifically where the elaborating information is another way of referring to the same entity. "HK" is presumably specifying the location of the entity, so it is something else
I don't think these kind of distinctions are in the scope of a deterministic annotator, unfortunately
[Either/or] is clearly a tagging error. From the tag guidelines:
Indeed. The task at hand was to reduce the number of validator errors produced when converting PTB (or just trees in general) to conll, and that's not possible without either editing the tags, changing the validator, or making the converted trees worse.
I think we have to live with the fact that PTB contains errors. My inclination would be to keep a whitelist of sentence IDs where we know the validator errors are due to a problem with the data, not the convertor.
For appositions, I think the rule would have to be that if "X , Y" is headed by Y, it is not an appositive. Maybe parataxis
is the safest bet. (Or maybe the head rules can be improved but I don't know how.)
I think we have to live with the fact that PTB contains errors. My inclination would be to keep a whitelist of sentence IDs where we know the validator errors are due to a problem with the data, not the convertor.
An exception for either_DT
with a cc:preconj
dependency and an or
later in the sentence might be reasonable. I guess the question would be whether it has more false negatives than we currently get in terms of false positives.
Based on these discussions, I would say the best we're going to accomplish here is to get rid of the extensive cop & aux disagreements with the verbs. With that in mind, there's a technical issue in our converter where it's getting the right (?) dependency but one of the matching patterns for changing the verb to AUX is firing when I don't think it should. There's a verb over verb pattern which frequently gets turned into advcl
, but also gets captured by the aux
patterns we have. Here's an example tree portion:
(SBAR-ADV (RB even) (IN if)
(S
(NP-SBJ
(NP (PRP$ your) (FW pilote) )
(PP (IN in)
(NP (JJ silly) (NN plaid) (NN beret) )))
(VP (VBD kept) <-----
(VP (VBG pointing)
(PRT (RP out) )
(SBAR
(WHADJP-2 (WRB how) (`` ``)
(ADJP (FW belle) )
('' '') )
(S
(NP-SBJ (PRP it) )
(DT all)
(VP (VBD was)
(ADJP-PRD (-NONE- *T*-2) )))))))))
The original version of the converter turns that into this:
23 even even ADV RB _ 31 advmod _ _
24 if if SCONJ IN _ 31 mark _ _
25 your you PRON PRP$ _ 26 nmod:poss _ _
26 pilote pilote X FW _ 31 nsubj _ _
27 in in ADP IN _ 30 case _ _
28 silly silly ADJ JJ _ 30 amod _ _
29 plaid plaid NOUN NN _ 30 compound _ _
30 beret beret NOUN NN _ 26 nmod _ _
31 kept keep VERB VBD _ 3 advcl _ _ <-----
32 pointing point VERB VBG _ 31 xcomp _ _
33 out out ADP RP _ 32 compound:prt _ _
34 how how ADV WRB _ 36 advmod _ _
35 `` `` PUNCT `` _ 36 punct _ _
36 belle belle X FW _ 40 dep _ _
37 '' '' PUNCT '' _ 36 punct _ _
38 it it PRON PRP _ 40 nsubj _ _
39 all all DET DT _ 40 dep _ _
40 was be VERB VBD _ 32 ccomp _ _
Because that section also matches the aux
pattern, though, the simplest way of upgrading the converter to add AUX tags for verb over verb auxiliaries captures this as well. I believe advcl
is the correct dependency and using a UPOS tag here is incorrect. Is that true? Certainly the validator doesn't like it if I do that...
I believe
advcl
is the correct dependency and using a UPOS tag here is incorrect. Is that true?
I believe advcl
+ VERB
is correct for keep. I don't understand your note about "using a UPOS tag" being incorrect. Did you mean to say that an AUX tag would be incorrect? Yes, it would. Keep is not considered an auxiliary in English UD.
Indeed, that it exactly what I mean - meant to say using an AUX
UPOS tag. The limitation here is that our converter has multiple deterministic rules which trigger for that tree section, one of them being the aux
rule. Fortunately it prefers the advcl
rule for the dependency, but because the aux
rule fired as well, my recent changes to the xpos->upos conversion incorrectly update that UPOS to AUX
In general that should be a fixable problem, and anyway the statistics for PTB are much better with my update:
before:
Morpho errors: 2
Syntax errors: 13628
Warnings: 16
after:
Morpho errors: 178
Syntax errors: 2910
Warnings: 16
I think I should be able to clean up at least some of the new morpho
errors I just created.
Thanks!
the
aux
rule fired as well
I think aux- (and cop-) related problems could be fixed if you can restrict the rule to particular lemmas – AUX is a closed class.
We do that with several of the rules, such as SINV over (VP over aux verb, not next to -ing verb)
For some reason we don't do that for VP over VP over another verb
, but in terms of the dependencies, the advcl
rules from earlier took precedence... Generally when there's no specific explanation in the code for why it's that way, I feel a compulsion to at least check with @manning to see if he knows why grad students years ago originally wrote it that way before I barge in and change things
... edit in the morning: actually, adding a no self loop to the rule in question fixes all of the newly introduced "morpho" errors and somehow fixes one of the ones that existed before my recent changes, without changing the dependency trees themselves. I'll call that a success
Not sure if this is a legit change we could make the validator: there is a sentence with "mighta" not tokenized into "might have" the way I mighta expected it to be:
# sent_id = 6756
1 If if SCONJ IN _ 4 mark _ _
2 it it PRON PRP _ 4 nsubj _ _
3 had have AUX VBD _ 4 aux _ _
4 been be VERB VBN _ 8 advcl _ _
5 , , PUNCT , _ 8 punct _ _
6 he he PRON PRP _ 8 nsubj _ _
7 mighta mighta AUX MD _ 8 aux _ _
8 hit hit VERB VB _ 0 root _ _
9 it it PRON PRP _ 8 obj _ _
10 out out ADP IN _ 8 compound:prt _ _
11 . . PUNCT . _ 8 punct _ _
12 '' '' PUNCT '' _ 8 punct _ _
Can we get mighta
added to the list of words the validator allows for AUX
in English?
The CoreNLP converter consistently changes whether or not
into a structure like this:
38 whether whether SCONJ IN _ 43 mark _ _
39 or or CCONJ CC _ 38 cc _ _
40 not not ADV RB _ 38 fixed _ _
41 it it PRON PRP _ 43 nsubj _ _
42 is be AUX VBZ _ 43 cop _ _
43 constitutional constitutional ADJ JJ _ 37 advcl _ _
1 Whether whether SCONJ IN _ 7 mark _ _
2 or or CCONJ CC _ 1 cc _ _
3 not not ADV RB _ 1 fixed _ _
4 `` `` PUNCT `` _ 7 punct _ _
5 great great ADJ JJ _ 6 amod _ _
6 cases case NOUN NNS _ 7 nsubj _ _
7 make make VERB VBP _ 18 dep _ _
8 bad-law bad-law NOUN NN _ 7 obj _ _
9 '' '' PUNCT '' _ 7 punct _ _
but then, if it's further apart, it does this:
8 whether whether SCONJ IN _ 13 mark _ _
9 accounts account NOUN NNS _ 13 nsubj:pass _ _
10 receivable receivable ADJ JJ _ 9 amod _ _
11 had have AUX VBD _ 13 aux _ _
12 been be AUX VBN _ 13 aux:pass _ _
13 paid pay VERB VBN _ 7 ccomp _ _
14 or or CCONJ CC _ 13 cc _ _
15 not not ADV RB _ 13 advmod _ _
In EWT, the whole thing of whether or not
is labeled fixed
when it appears next to each other:
# sent_id = email-enronsent01_02-0038
1 So so ADV RB _ 3 advmod 3:advmod _
2 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 3 nsubj 3:nsubj _
3 question question VERB VBP Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin 0 root 0:root _
4 whether whether SCONJ IN _ 8 mark 8:mark _
5 or or CCONJ CC _ 4 fixed 4:fixed _
6 not not PART RB _ 4 fixed 4:fixed _
7 you you PRON PRP Case=Nom|Person=2|PronType=Prs 8 nsubj 8:nsubj|10:nsubj:xsubj _
8 want want VERB VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin 3 ccomp 3:ccomp _
9 to to PART TO _ 10 mark 10:mark _
10 publish publish VERB VB VerbForm=Inf 8 xcomp 8:xcomp _
11 info info NOUN NN Number=Sing 10 obj 10:obj _
Is that the standard CoreNLP's converter should use?
Forgive my ignorance of what may be a standard dependency, but what should be the dependency between begin
and notes
in the following sentence?
# sent_id = 2750
1 And and CCONJ CC _ 28 cc _ _
2 while while SCONJ IN _ 10 mark _ _
3 customers customer NOUN NNS _ 10 nsubj _ _
4 such such ADJ JJ _ 8 case _ _
5 as as ADP IN _ 4 fixed _ _
6 steel steel NOUN NN _ 8 compound _ _
7 service service NOUN NN _ 8 compound _ _
8 centers center NOUN NNS _ 3 nmod _ _
9 are be AUX VBP _ 10 aux _ _
10 continuing continue VERB VBG _ 22 advcl _ _
11 to to PART TO _ 12 mark _ _
12 reduce reduce VERB VB _ 10 xcomp _ _
13 inventories inventory NOUN NNS _ 12 obj _ _
14 through through ADP IN _ 17 case _ _
15 the the DET DT _ 17 det _ _
16 fourth fourth ADJ JJ _ 17 amod _ _
17 quarter quarter NOUN NN _ 10 obl _ _
18 , , PUNCT , _ 22 punct _ _
19 they they PRON PRP _ 22 nsubj _ _
20 eventually eventually ADV RB _ 22 advmod _ _
21 will will AUX MD _ 22 aux _ _
22 begin begin VERB VB _ 28 dep _ _
23 stocking stock VERB VBG _ 22 xcomp _ _
24 up up ADP RP _ 23 compound:prt _ _
25 again again ADV RB _ 23 advmod _ _
26 , , PUNCT , _ 28 punct _ _
27 he he PRON PRP _ 28 nsubj _ _
28 notes note VERB VBZ _ 0 root _ _
29 . . PUNCT . _ 28 punct _ _
using an AUX UPOS tag
If the only issue is the upos and the tree is correct, is it worth considering passing an xpos -> upos script over the data? FWIW GUM upos is generated from xpos and the tree, and it seems to work fine (I'm sure there are some issues here and there, but it's fairly battle tested at this point)
If the only issue is the upos and the tree is correct, is it worth considering passing an xpos -> upos script over the data?
That is basically what we do as well, except this was a non-battle-tested case...
https://github.com/stanfordnlp/CoreNLP/blob/dev/src/edu/stanford/nlp/trees/UniversalPOSMapper.java
Not sure if this is a legit change we could make the validator: there is a sentence with "mighta" not tokenized into "might have" the way I mighta expected it to be:
# sent_id = 6756 1 If if SCONJ IN _ 4 mark _ _ 2 it it PRON PRP _ 4 nsubj _ _ 3 had have AUX VBD _ 4 aux _ _ 4 been be VERB VBN _ 8 advcl _ _ 5 , , PUNCT , _ 8 punct _ _ 6 he he PRON PRP _ 8 nsubj _ _ 7 mighta mighta AUX MD _ 8 aux _ _ 8 hit hit VERB VB _ 0 root _ _ 9 it it PRON PRP _ 8 obj _ _ 10 out out ADP IN _ 8 compound:prt _ _ 11 . . PUNCT . _ 8 punct _ _ 12 '' '' PUNCT '' _ 8 punct _ _
Can we get
mighta
added to the list of words the validator allows forAUX
in English?
Wouldn't it be better to give it a lemma that is already on the list? Might seems to be a good candidate. And if it is a contraction of might have, then I would consider treating it as a multiword token and splitting it to might and have.
Forgive my ignorance of what may be a standard dependency, but what should be the dependency between
begin
andnotes
in the following sentence?# sent_id = 2750 1 And and CCONJ CC _ 28 cc _ _ 2 while while SCONJ IN _ 10 mark _ _ 3 customers customer NOUN NNS _ 10 nsubj _ _ 4 such such ADJ JJ _ 8 case _ _ 5 as as ADP IN _ 4 fixed _ _ 6 steel steel NOUN NN _ 8 compound _ _ 7 service service NOUN NN _ 8 compound _ _ 8 centers center NOUN NNS _ 3 nmod _ _ 9 are be AUX VBP _ 10 aux _ _ 10 continuing continue VERB VBG _ 22 advcl _ _ 11 to to PART TO _ 12 mark _ _ 12 reduce reduce VERB VB _ 10 xcomp _ _ 13 inventories inventory NOUN NNS _ 12 obj _ _ 14 through through ADP IN _ 17 case _ _ 15 the the DET DT _ 17 det _ _ 16 fourth fourth ADJ JJ _ 17 amod _ _ 17 quarter quarter NOUN NN _ 10 obl _ _ 18 , , PUNCT , _ 22 punct _ _ 19 they they PRON PRP _ 22 nsubj _ _ 20 eventually eventually ADV RB _ 22 advmod _ _ 21 will will AUX MD _ 22 aux _ _ 22 begin begin VERB VB _ 28 dep _ _ 23 stocking stock VERB VBG _ 22 xcomp _ _ 24 up up ADP RP _ 23 compound:prt _ _ 25 again again ADV RB _ 23 advmod _ _ 26 , , PUNCT , _ 28 punct _ _ 27 he he PRON PRP _ 28 nsubj _ _ 28 notes note VERB VBZ _ 0 root _ _ 29 . . PUNCT . _ 28 punct _ _
It should be ccomp
as per Amendment 3.
And if it is a contraction of might have, then I would consider treating it as a multiword token and splitting it to might and have.
I believe this is the correct interpretation (part of the woulda, coulda, shoulda family) In general we aren't editatorializing words by splitting them in the converter, so doing that here would be a unique case. I personally think "might" as the lemma would be wrong, since it drops the "have" part of the meaning. Ultimately it might be a case where the validator and the CoreNLP converter never agree
In general we aren't editatorializing words by splitting them in the converter
But I suppose you could :-)
That is basically what we do as well, except this was a non-battle-tested case...
Feel free to diff its output with ours:
https://github.com/amir-zeldes/gum/blob/master/_build/utils/upos.ini
> pip install depedit
> python -m depedit -c upos.ini file.conllu > output.conllu
In general we aren't editatorializing words by splitting them in the converter
But I suppose you could :-)
Coulda...
but we already get enough "why does your tokenizer do this weird thing" git issues. Intentionally unaligning the tokens in the constituency & dependency graphs is just asking for giving me headaches
fixed
approach is just for when the 3 words are together. "Whether you like it or not" can be a coordination between "like" and "not" (where "not" is short for "don't like it")."mighta": splitting -a off is reasonable in the abstract (cf. "gonna" => "gon na"). But I don't see any such tokens in EWT, and I would be loath to mess with the Penn tokenization. Maybe just leave as is (and keep "a" in the lemma as it affects the morphosyntax of the clause)?
Agreed. That leaves adding it to the validator as pretty much the only way to resolve that error, but not sure that's on the menu
I'm willing to update the validator unless @amir-zeldes strongly objects.
(I found just one relevant token in GUM, where "would" is contracted: "You'd a".)
Assuming we add mighta, musta, coulda, shoulda, woulda, oughtta to the validator, I guess the features should just be VerbForm=Fin
(which applies to "might") plus Style=Coll
(colloquial)?
Does anyone know that is the best approach to convert a treebank in PTB format to UD 2.0? I found the page https://nlp.stanford.edu/software/stanford-dependencies.html, but it is not clear if the code supports UD 2.0. Suggestions are welcome.