Closed arademaker closed 4 years ago
Two issues were already reported in the propbank and UD_English repositories:
some mistakes between the PoS tag in the propbank data and the xpostag in the UD data deserve attention. One example is:
# sent_id = reviews-398243-0007
# text = The price was actually lower than what I had anticipated and used to compared to other places, plus he showed me the work he did when I came into pick up the car.
1 The the DET DT Definite=Def|PronType=Art 2 det 2:det Tree=(TOP(S(S(NP*|Framefile=-|Roleset=-|Args=*/(ARG1*/(ARG1*/*/*/*/*/*/*/*
2 price price NOUN NN Number=Sing 5 nsubj 5:nsubj Tree=*)|Framefile=price|Roleset=price.01|Args=(V*)/*)/*)/*/*/*/*/*/*/*
3 was be AUX VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 5 cop 5:cop Tree=(VP*|Framefile=be|Roleset=be.01|Args=*/(V*)/*/*/*/*/*/*/*/*
4 actually actually ADV RB _ 5 advmod 5:advmod Tree=(ADVP*)|Framefile=-|Roleset=-|Args=*/(ARGM-ADV*)/(ARGM-ADV*)/*/*/*/*/*/*/*
5 lower lower ADJ JJR Degree=Cmp 0 root 0:root Tree=(ADJP(ADJP*)|Framefile=low|Roleset=low.04|Args=*/(ARG2*/(V*)/*/*/*/*/*/*/*
6 than than SCONJ IN _ 7 case 7:case Tree=(PP*|Framefile=-|Roleset=-|Args=*/*/(ARGM-CXN*/*/*/*/*/*/*/*
7 what what PRON WP PronType=Int 5 obl 5:obl:than Tree=(SBAR(WHNP*)|Framefile=-|Roleset=-|Args=*/*/*/*/(ARG1*)/*/*/*/*/*
8 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 10 nsubj 10:nsubj|12:nsubj|14:nsubj:xsubj Tree=(S(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/(ARG0*)/*/*/*/*/*
9 had have AUX VBD Mood=Ind|Tense=Past|VerbForm=Fin 10 aux 10:aux Tree=(UCP(VP*|Framefile=have|Roleset=have.01|Args=*/*/*/(V*)/*/*/*/*/*/*
10 anticipated anticipate VERB VBN Tense=Past|VerbForm=Part 7 acl:relcl 7:acl:relcl Tree=(VP*))|Framefile=anticipate|Roleset=anticipate.01|Args=*/*/*/*/(V*)/*/*/*/*/*
11 and and CCONJ CC _ 12 cc 12:cc Tree=*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
12 used use VERB VBN Tense=Past|VerbForm=Part 10 conj 7:acl:relcl|10:conj:and Tree=(FRAG(ADJP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*|PBPOS=JJ
13 to to ADP IN _ 14 aux 14:aux Tree=(PP*)))|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
14 compared compare VERB VBN Tense=Past|VerbForm=Part 12 xcomp 12:xcomp Tree=(PP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
15 to to ADP IN _ 17 case 17:case Tree=(PP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
16 other other ADJ JJ Degree=Pos 17 amod 17:amod Tree=(NP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
17 places place NOUN NNS Number=Plur 14 obl 14:obl:to SpaceAfter=No|Tree=*))))))))))|Framefile=-|Roleset=-|Args=*/*)/*)/*/*/*/*/*/*/*
18 , , PUNCT , _ 21 punct 21:punct Tree=*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
19 plus plus CCONJ CC _ 21 cc 21:cc Tree=*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
20 he he PRON PRP Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs 21 nsubj 21:nsubj Tree=(S(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/(ARG0*)/*/*/*/*
21 showed show VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin 5 conj 5:conj:plus Tree=(VP*|Framefile=show|Roleset=show.01|Args=*/*/*/*/*/(V*)/*/*/*/*
22 me I PRON PRP Case=Acc|Number=Sing|Person=1|PronType=Prs 21 iobj 21:iobj Tree=(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/(ARG2*)/*/*/*/*
23 the the DET DT Definite=Def|PronType=Art 24 det 24:det Tree=(NP(NP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/(ARG1*/*/*/*/*
24 work work NOUN NN Number=Sing 21 obj 21:obj Tree=*)|Framefile=work|Roleset=work.01|Args=*/*/*/*/*/*/(V*)/(ARGM-PRR*)/*/*
25 he he PRON PRP Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs 26 nsubj 26:nsubj Tree=(SBAR(S(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/(ARG0*)/*/*/*
26 did do VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin 24 acl:relcl 24:acl:relcl Tree=(VP*))))|Framefile=do|Roleset=do.LV|Args=*/*/*/*/*/*)/(ARGM-LVB*)/(V*)/*/*
27 when when ADV WRB PronType=Int 29 mark 29:mark Tree=(SBAR(WHADVP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/(ARGM-TMP*/*/*/(ARGM-TMP*)/*
28 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 29 nsubj 29:nsubj Tree=(S(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/(ARG1*)/(ARG0*)
29 came come VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin 21 advcl 21:advcl:when Tree=(VP*|Framefile=come|Roleset=come.01|Args=*/*/*/*/*/*/*/*/(V*)/*
30 in in ADV RB _ 29 advmod 29:advmod SpaceAfter=No|Tree=(ADVP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/(ARGM-DIR*)/*
31 to to PART TO _ 32 mark 32:mark Tree=(S(VP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/(ARGM-PRP*/*
32 pick pick VERB VB VerbForm=Inf 29 advcl 29:advcl:to Tree=(VP*|Framefile=pick|Roleset=pick_up.04|Args=*/*/*/*/*/*/*/*/*/(V*
33 up up ADP RP _ 32 compound:prt 32:compound:prt Tree=(PRT*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*)
34 the the DET DT Definite=Def|PronType=Art 35 det 35:det Tree=(NP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/(ARG1*
35 car car NOUN NN Number=Sing 32 obj 32:obj SpaceAfter=No|Tree=*)))))))))|Framefile=-|Roleset=-|Args=*/*/*/*/*/*)/*/*/*)/*)
used was analyzed as JJ in EWT, and, therefore, it was not annotated as a predicate. In UD this token was analyzed as VERB. The tree structure of UD should be very different from the ptb tree.
This is the summary of cases where the original PoS tag is complete different from the current xpostag in the UD data:
% gawk '$0 ~ /sent_id/ {sent=$0} $10 ~ /PBPOS/ {print $4,$5,gensub(/.*(PBPOS=[^/]+).*/,"\\1","g",$10)}' ud+prop.conllu | sort | uniq -c | sort -nr
23 NOUN NNS PBPOS=NN
14 ADV RP PBPOS=RB
8 ADJ JJ PBPOS=NN
6 PRON PRP PBPOS=PRP$
6 ADV JJ PBPOS=RB
5 SYM SYM PBPOS=NN
5 NOUN NN PBPOS=JJ
5 ADV RB PBPOS=IN
4 DET DT PBPOS=NN
4 ADP RP PBPOS=IN
4 ADJ RB PBPOS=JJ
3 X GW PBPOS=NN
3 PROPN JJ PBPOS=NNP
3 NOUN NN PBPOS=IN
3 NOUN NN PBPOS=GW
3 NOUN NN PBPOS=CD
3 ADP RB PBPOS=RP
3 ADJ NN PBPOS=JJ
3 ADJ DT PBPOS=JJ
2 X NNP PBPOS=GW
2 X GW PBPOS=JJ
2 SYM SYM PBPOS=DT
2 SCONJ IN PBPOS=CC
2 PROPN NNP PBPOS=NN
2 PRON PRP$ PBPOS=PRP
2 PART RB PBPOS=VB
2 NOUN NN PBPOS=VB
2 NOUN NN PBPOS=NNP
2 CCONJ CC PBPOS=DT
2 ADV RB PBPOS=CC
2 ADP RB PBPOS=IN
2 ADJ NNP PBPOS=JJ
2 ADJ JJ PBPOS=GW
1 X GW PBPOS=VBN
1 X ADD PBPOS=RP
1 VERB VBP PBPOS=IN
1 VERB VBN PBPOS=JJ
1 VERB VBN PBPOS=GW
1 VERB VBG PBPOS=NN
1 VERB VB PBPOS=TO
1 VERB VB PBPOS=RB
1 VERB VB PBPOS=NNS
1 VERB NNS PBPOS=VBZ
1 SYM UH PBPOS=.
1 SYM UH PBPOS=-RRB-
1 SYM SYM PBPOS=NNP
1 SYM NFP PBPOS=-RRB-
1 SYM NFP PBPOS=-LRB-
1 SYM IN PBPOS=SYM
1 SCONJ IN PBPOS=RB
1 PUNCT PDT PBPOS=''
1 PUNCT . PBPOS=NFP
1 PUNCT -RRB- PBPOS=NFP
1 PUNCT , PBPOS=HYPH
1 PUNCT , PBPOS=.
1 PROPN NNPS PBPOS=NNS
1 PROPN NN PBPOS=NNP
1 PRON EX PBPOS=RB
1 PRON EX PBPOS=PRP
1 PRON DT PBPOS=IN
1 PART TO PBPOS=PRP
1 NUM NN PBPOS=CD
1 NOUN VBG PBPOS=NN
1 NOUN UH PBPOS=NNS
1 NOUN NNS PBPOS=VBZ
1 NOUN NNP PBPOS=NN
1 NOUN NN PBPOS=VBN
1 NOUN NN PBPOS=VBG
1 NOUN NN PBPOS=RB
1 NOUN JJ PBPOS=NN
1 INTJ NN PBPOS=UH
1 INTJ JJ PBPOS=UH
1 DET DT PBPOS=PRP
1 CCONJ CC PBPOS=VB
1 CCONJ CC PBPOS=NNP
1 CCONJ CC PBPOS=NN
1 CCONJ CC PBPOS=IN
1 ADV RBR PBPOS=RB
1 ADV RB PBPOS=VBG
1 ADV RB PBPOS=JJ
1 ADV NN PBPOS=RBS
1 ADV IN PBPOS=RB
1 ADV CC PBPOS=RB
1 ADP TO PBPOS=IN
1 ADP IN PBPOS=RB
1 ADP IN PBPOS=JJ
1 ADP IN PBPOS=DT
1 ADP CC PBPOS=IN
1 ADJ JJ PBPOS=VB
1 ADJ JJ PBPOS=RB
1 ADJ JJ PBPOS=IN
The case ADJ JJ PBPOS=VB
means that it was a VB
in EWT/Propbank but in UD analyzed as ADJ. The case NOUN NN PBPOS=VB
means that it was a VB
in the EWT/Propbank but UD consider it NOUN
.
We have 177 sentences with these differences between the xpostag in UD data and the POS tag in the Propbank/EWT data:
% gawk '$0 ~ /sent_id/ {sent=$0} $10 ~ /PBPOS/ {print sent}' ud+prop.conllu | sort | uniq -c | sort -nr | wc -l
177
My suggestions are:
Concerning PTB metadata, my suggestion is that we use the new UD data in practice, ignoring the old data, but does not remove the related info from the data itself. While in all future work it is good to use UD data only, there may be occassions people want to compare evaluations of models based on new and old data, and having these links to the past in the same file may be useful.
In f631cfa I introduce the first version of the merge. Data is not ready for merging into the master.
@arademaker thanks for performing this merge - this will be very useful for anyone that wants to train SRL systems over UD!
A quick question on the format: The Args
part (see below) could become very long in a sentence with many verbs / frame evoking elements (it could become somtehing like Args=_/_/_/_/_/_/_/_/_
for each word in the sentence), perhaps impacting readability. The Finnish Proposition Bank has an alternative encoding (see here) that may be more compact/readable?
# newdoc id = weblog-blogspot.com_gettingpolitical_20030906235000_ENG_20030906_235000
# sent_id = weblog-blogspot.com_gettingpolitical_20030906235000_ENG_20030906_235000-0001
# text = The sheikh in wheel-chair has been attacked with a F-16-launched bomb.
1 The the DET DT Definite=Def|PronType=Art 2 det 2:det Framefile=-|Roleset=-|Args=_/_/_
2 sheikh sheikh NOUN NN Number=Sing 9 nsubj:pass 9:nsubj:pass Framefile=-|Roleset=-|Args=_/_/ARG1
3 in in ADP IN _ 6 case 6:case Framefile=-|Roleset=-|Args=_/_/_
4 wheel wheel NOUN NN Number=Sing 6 compound 6:compound SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
5 - - PUNCT HYPH _ 6 punct 6:punct SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
6 chair chair NOUN NN Number=Sing 2 nmod 2:nmod:in Framefile=-|Roleset=-|Args=_/_/_
7 has have AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 9 aux 9:aux Framefile=have|Roleset=have.01|Args=V/_/_
8 been be AUX VBN Tense=Past|VerbForm=Part 9 aux:pass 9:aux:pass Framefile=be|Roleset=be.03|Args=_/V/_
9 attacked attack VERB VBN Tense=Past|VerbForm=Part 0 root 0:root Framefile=attack|Roleset=attack.01|Args=_/_/V
10 with with ADP IN _ 17 case 17:case Framefile=-|Roleset=-|Args=_/_/_
11 a a DET DT Definite=Ind|PronType=Art 17 det 17:det Framefile=-|Roleset=-|Args=_/_/_
12 F f NOUN NN Number=Sing 16 compound 16:compound SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
13 - - PUNCT HYPH _ 12 punct 12:punct SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
14 16 16 NUM CD NumType=Card 12 compound 12:compound SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
15 - - PUNCT HYPH _ 16 punct 16:punct SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
16 launched launch VERB VBN Tense=Past|VerbForm=Part 17 acl 17:acl Framefile=-|Roleset=-|Args=_/_/_
17 bomb bomb NOUN NN Number=Sing 9 obl 9:obl:with SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/ARGM-MNR
18 . . PUNCT . _ 9 punct 9:punct Framefile=-|Roleset=-|Args=_/_/_
Thank you @alanakbik , I agree with have to think a little bit more about the final format. I actually ended up using the same format used for the other languages and improved the README file explaining that the .conllu
files in this repo are not actual valid .conllu
according to UD specifications.
This is a bad situation because the extension may let people believe that standard CoNNL-U readers can parse the files and it is not the case for now. I also don't like to have to deal with a variable number of columns per sentence. The format you suggest above seems to be very concise and it can be encoded in the MISC column.
Other options are:
@huaiyu-zhu , @yunyaoli ?
Or choose a different extension, e.g. .conllusrl
?
I would probably vote against encoding this information in the sentence metadata. CoNLL-U plus or changing the extension are good solutions, but best might be to have this in valid CoNLL-U format since this is what most people/tools use.
So I like your way of encoding SRL in the MISC column, just perhaps the readability could be improved by encoding the arguments with a id-pointer system like in the Finnish Propbank or the enhanced dependency graph (that uses head-deprel pairs)?
The idea is to merge the data from propbank in https://github.com/propbank/propbank-release (subset with the EWT treebank) with the http://github.com/universaldependencies/UD_English-EWT (same sentences from the EWT with UD annotations and revisions)