adding English data from the merge of UD and Propbank

arademaker commented 4 years ago

The idea is to merge the data from propbank in https://github.com/propbank/propbank-release (subset with the EWT treebank) with the http://github.com/universaldependencies/UD_English-EWT (same sentences from the EWT with UD annotations and revisions)

arademaker commented 4 years ago

Two issues were already reported in the propbank and UD_English repositories:

arademaker commented 4 years ago

some mistakes between the PoS tag in the propbank data and the xpostag in the UD data deserve attention. One example is:

# sent_id = reviews-398243-0007
# text = The price was actually lower than what I had anticipated and used to compared to other places, plus he showed me the work he did when I came into pick up the car.
1   The the DET DT  Definite=Def|PronType=Art   2   det 2:det   Tree=(TOP(S(S(NP*|Framefile=-|Roleset=-|Args=*/(ARG1*/(ARG1*/*/*/*/*/*/*/*
2   price   price   NOUN    NN  Number=Sing 5   nsubj   5:nsubj Tree=*)|Framefile=price|Roleset=price.01|Args=(V*)/*)/*)/*/*/*/*/*/*/*
3   was be  AUX VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   5   cop 5:cop   Tree=(VP*|Framefile=be|Roleset=be.01|Args=*/(V*)/*/*/*/*/*/*/*/*
4   actually    actually    ADV RB  _   5   advmod  5:advmod    Tree=(ADVP*)|Framefile=-|Roleset=-|Args=*/(ARGM-ADV*)/(ARGM-ADV*)/*/*/*/*/*/*/*
5   lower   lower   ADJ JJR Degree=Cmp  0   root    0:root  Tree=(ADJP(ADJP*)|Framefile=low|Roleset=low.04|Args=*/(ARG2*/(V*)/*/*/*/*/*/*/*
6   than    than    SCONJ   IN  _   7   case    7:case  Tree=(PP*|Framefile=-|Roleset=-|Args=*/*/(ARGM-CXN*/*/*/*/*/*/*/*
7   what    what    PRON    WP  PronType=Int    5   obl 5:obl:than  Tree=(SBAR(WHNP*)|Framefile=-|Roleset=-|Args=*/*/*/*/(ARG1*)/*/*/*/*/*
8   I   I   PRON    PRP Case=Nom|Number=Sing|Person=1|PronType=Prs  10  nsubj   10:nsubj|12:nsubj|14:nsubj:xsubj    Tree=(S(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/(ARG0*)/*/*/*/*/*
9   had have    AUX VBD Mood=Ind|Tense=Past|VerbForm=Fin    10  aux 10:aux  Tree=(UCP(VP*|Framefile=have|Roleset=have.01|Args=*/*/*/(V*)/*/*/*/*/*/*
10  anticipated anticipate  VERB    VBN Tense=Past|VerbForm=Part    7   acl:relcl   7:acl:relcl Tree=(VP*))|Framefile=anticipate|Roleset=anticipate.01|Args=*/*/*/*/(V*)/*/*/*/*/*
11  and and CCONJ   CC  _   12  cc  12:cc   Tree=*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
12  used    use VERB    VBN Tense=Past|VerbForm=Part    10  conj    7:acl:relcl|10:conj:and Tree=(FRAG(ADJP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*|PBPOS=JJ
13  to  to  ADP IN  _   14  aux 14:aux  Tree=(PP*)))|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
14  compared    compare VERB    VBN Tense=Past|VerbForm=Part    12  xcomp   12:xcomp    Tree=(PP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
15  to  to  ADP IN  _   17  case    17:case Tree=(PP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
16  other   other   ADJ JJ  Degree=Pos  17  amod    17:amod Tree=(NP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
17  places  place   NOUN    NNS Number=Plur 14  obl 14:obl:to   SpaceAfter=No|Tree=*))))))))))|Framefile=-|Roleset=-|Args=*/*)/*)/*/*/*/*/*/*/*
18  ,   ,   PUNCT   ,   _   21  punct   21:punct    Tree=*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
19  plus    plus    CCONJ   CC  _   21  cc  21:cc   Tree=*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
20  he  he  PRON    PRP Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs  21  nsubj   21:nsubj    Tree=(S(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/(ARG0*)/*/*/*/*
21  showed  show    VERB    VBD Mood=Ind|Tense=Past|VerbForm=Fin    5   conj    5:conj:plus Tree=(VP*|Framefile=show|Roleset=show.01|Args=*/*/*/*/*/(V*)/*/*/*/*
22  me  I   PRON    PRP Case=Acc|Number=Sing|Person=1|PronType=Prs  21  iobj    21:iobj Tree=(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/(ARG2*)/*/*/*/*
23  the the DET DT  Definite=Def|PronType=Art   24  det 24:det  Tree=(NP(NP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/(ARG1*/*/*/*/*
24  work    work    NOUN    NN  Number=Sing 21  obj 21:obj  Tree=*)|Framefile=work|Roleset=work.01|Args=*/*/*/*/*/*/(V*)/(ARGM-PRR*)/*/*
25  he  he  PRON    PRP Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs  26  nsubj   26:nsubj    Tree=(SBAR(S(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/(ARG0*)/*/*/*
26  did do  VERB    VBD Mood=Ind|Tense=Past|VerbForm=Fin    24  acl:relcl   24:acl:relcl    Tree=(VP*))))|Framefile=do|Roleset=do.LV|Args=*/*/*/*/*/*)/(ARGM-LVB*)/(V*)/*/*
27  when    when    ADV WRB PronType=Int    29  mark    29:mark Tree=(SBAR(WHADVP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/(ARGM-TMP*/*/*/(ARGM-TMP*)/*
28  I   I   PRON    PRP Case=Nom|Number=Sing|Person=1|PronType=Prs  29  nsubj   29:nsubj    Tree=(S(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/(ARG1*)/(ARG0*)
29  came    come    VERB    VBD Mood=Ind|Tense=Past|VerbForm=Fin    21  advcl   21:advcl:when   Tree=(VP*|Framefile=come|Roleset=come.01|Args=*/*/*/*/*/*/*/*/(V*)/*
30  in  in  ADV RB  _   29  advmod  29:advmod   SpaceAfter=No|Tree=(ADVP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/(ARGM-DIR*)/*
31  to  to  PART    TO  _   32  mark    32:mark Tree=(S(VP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/(ARGM-PRP*/*
32  pick    pick    VERB    VB  VerbForm=Inf    29  advcl   29:advcl:to Tree=(VP*|Framefile=pick|Roleset=pick_up.04|Args=*/*/*/*/*/*/*/*/*/(V*
33  up  up  ADP RP  _   32  compound:prt    32:compound:prt Tree=(PRT*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*)
34  the the DET DT  Definite=Def|PronType=Art   35  det 35:det  Tree=(NP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/(ARG1*
35  car car NOUN    NN  Number=Sing 32  obj 32:obj  SpaceAfter=No|Tree=*)))))))))|Framefile=-|Roleset=-|Args=*/*/*/*/*/*)/*/*/*)/*)

used was analyzed as JJ in EWT, and, therefore, it was not annotated as a predicate. In UD this token was analyzed as VERB. The tree structure of UD should be very different from the ptb tree.

arademaker commented 4 years ago

This is the summary of cases where the original PoS tag is complete different from the current xpostag in the UD data:

% gawk '$0 ~ /sent_id/ {sent=$0} $10 ~ /PBPOS/ {print $4,$5,gensub(/.*(PBPOS=[^/]+).*/,"\\1","g",$10)}' ud+prop.conllu  | sort | uniq -c | sort -nr
  23 NOUN NNS PBPOS=NN
  14 ADV RP PBPOS=RB
   8 ADJ JJ PBPOS=NN
   6 PRON PRP PBPOS=PRP$
   6 ADV JJ PBPOS=RB
   5 SYM SYM PBPOS=NN
   5 NOUN NN PBPOS=JJ
   5 ADV RB PBPOS=IN
   4 DET DT PBPOS=NN
   4 ADP RP PBPOS=IN
   4 ADJ RB PBPOS=JJ
   3 X GW PBPOS=NN
   3 PROPN JJ PBPOS=NNP
   3 NOUN NN PBPOS=IN
   3 NOUN NN PBPOS=GW
   3 NOUN NN PBPOS=CD
   3 ADP RB PBPOS=RP
   3 ADJ NN PBPOS=JJ
   3 ADJ DT PBPOS=JJ
   2 X NNP PBPOS=GW
   2 X GW PBPOS=JJ
   2 SYM SYM PBPOS=DT
   2 SCONJ IN PBPOS=CC
   2 PROPN NNP PBPOS=NN
   2 PRON PRP$ PBPOS=PRP
   2 PART RB PBPOS=VB
   2 NOUN NN PBPOS=VB
   2 NOUN NN PBPOS=NNP
   2 CCONJ CC PBPOS=DT
   2 ADV RB PBPOS=CC
   2 ADP RB PBPOS=IN
   2 ADJ NNP PBPOS=JJ
   2 ADJ JJ PBPOS=GW
   1 X GW PBPOS=VBN
   1 X ADD PBPOS=RP
   1 VERB VBP PBPOS=IN
   1 VERB VBN PBPOS=JJ
   1 VERB VBN PBPOS=GW
   1 VERB VBG PBPOS=NN
   1 VERB VB PBPOS=TO
   1 VERB VB PBPOS=RB
   1 VERB VB PBPOS=NNS
   1 VERB NNS PBPOS=VBZ
   1 SYM UH PBPOS=.
   1 SYM UH PBPOS=-RRB-
   1 SYM SYM PBPOS=NNP
   1 SYM NFP PBPOS=-RRB-
   1 SYM NFP PBPOS=-LRB-
   1 SYM IN PBPOS=SYM
   1 SCONJ IN PBPOS=RB
   1 PUNCT PDT PBPOS=''
   1 PUNCT . PBPOS=NFP
   1 PUNCT -RRB- PBPOS=NFP
   1 PUNCT , PBPOS=HYPH
   1 PUNCT , PBPOS=.
   1 PROPN NNPS PBPOS=NNS
   1 PROPN NN PBPOS=NNP
   1 PRON EX PBPOS=RB
   1 PRON EX PBPOS=PRP
   1 PRON DT PBPOS=IN
   1 PART TO PBPOS=PRP
   1 NUM NN PBPOS=CD
   1 NOUN VBG PBPOS=NN
   1 NOUN UH PBPOS=NNS
   1 NOUN NNS PBPOS=VBZ
   1 NOUN NNP PBPOS=NN
   1 NOUN NN PBPOS=VBN
   1 NOUN NN PBPOS=VBG
   1 NOUN NN PBPOS=RB
   1 NOUN JJ PBPOS=NN
   1 INTJ NN PBPOS=UH
   1 INTJ JJ PBPOS=UH
   1 DET DT PBPOS=PRP
   1 CCONJ CC PBPOS=VB
   1 CCONJ CC PBPOS=NNP
   1 CCONJ CC PBPOS=NN
   1 CCONJ CC PBPOS=IN
   1 ADV RBR PBPOS=RB
   1 ADV RB PBPOS=VBG
   1 ADV RB PBPOS=JJ
   1 ADV NN PBPOS=RBS
   1 ADV IN PBPOS=RB
   1 ADV CC PBPOS=RB
   1 ADP TO PBPOS=IN
   1 ADP IN PBPOS=RB
   1 ADP IN PBPOS=JJ
   1 ADP IN PBPOS=DT
   1 ADP CC PBPOS=IN
   1 ADJ JJ PBPOS=VB
   1 ADJ JJ PBPOS=RB
   1 ADJ JJ PBPOS=IN

The case ADJ JJ PBPOS=VB means that it was a VB in EWT/Propbank but in UD analyzed as ADJ. The case NOUN NN PBPOS=VB means that it was a VB in the EWT/Propbank but UD consider it NOUN.

We have 177 sentences with these differences between the xpostag in UD data and the POS tag in the Propbank/EWT data:

% gawk '$0 ~ /sent_id/ {sent=$0} $10 ~ /PBPOS/ {print sent}' ud+prop.conllu  | sort | uniq -c | sort -nr | wc -l
     177

arademaker commented 4 years ago

My suggestions are:

let us defined the output format. See that extra columns were all added in the MISC field for producing a valid conllu format with 10 columns. But it is easy to expand the values for extra columns.
mark all these 177 sentences with a metadata for manual verification of the SRL annotation.
We should not use the PTB trees anymore, I can remove it from the MISC fields.

huaiyu-zhu commented 4 years ago

Concerning PTB metadata, my suggestion is that we use the new UD data in practice, ignoring the old data, but does not remove the related info from the data itself. While in all future work it is good to use UD data only, there may be occassions people want to compare evaluations of models based on new and old data, and having these links to the past in the same file may be useful.

arademaker commented 4 years ago

In f631cfa I introduce the first version of the merge. Data is not ready for merging into the master.

alanakbik commented 4 years ago

@arademaker thanks for performing this merge - this will be very useful for anyone that wants to train SRL systems over UD!

A quick question on the format: The Args part (see below) could become very long in a sentence with many verbs / frame evoking elements (it could become somtehing like Args=_/_/_/_/_/_/_/_/_ for each word in the sentence), perhaps impacting readability. The Finnish Proposition Bank has an alternative encoding (see here) that may be more compact/readable?

# newdoc id = weblog-blogspot.com_gettingpolitical_20030906235000_ENG_20030906_235000
# sent_id = weblog-blogspot.com_gettingpolitical_20030906235000_ENG_20030906_235000-0001
# text = The sheikh in wheel-chair has been attacked with a F-16-launched bomb.
1   The the DET DT  Definite=Def|PronType=Art   2   det 2:det   Framefile=-|Roleset=-|Args=_/_/_
2   sheikh  sheikh  NOUN    NN  Number=Sing 9   nsubj:pass  9:nsubj:pass    Framefile=-|Roleset=-|Args=_/_/ARG1
3   in  in  ADP IN  _   6   case    6:case  Framefile=-|Roleset=-|Args=_/_/_
4   wheel   wheel   NOUN    NN  Number=Sing 6   compound    6:compound  SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
5   -   -   PUNCT   HYPH    _   6   punct   6:punct SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
6   chair   chair   NOUN    NN  Number=Sing 2   nmod    2:nmod:in   Framefile=-|Roleset=-|Args=_/_/_
7   has have    AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   9   aux 9:aux   Framefile=have|Roleset=have.01|Args=V/_/_
8   been    be  AUX VBN Tense=Past|VerbForm=Part    9   aux:pass    9:aux:pass  Framefile=be|Roleset=be.03|Args=_/V/_
9   attacked    attack  VERB    VBN Tense=Past|VerbForm=Part    0   root    0:root  Framefile=attack|Roleset=attack.01|Args=_/_/V
10  with    with    ADP IN  _   17  case    17:case Framefile=-|Roleset=-|Args=_/_/_
11  a   a   DET DT  Definite=Ind|PronType=Art   17  det 17:det  Framefile=-|Roleset=-|Args=_/_/_
12  F   f   NOUN    NN  Number=Sing 16  compound    16:compound SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
13  -   -   PUNCT   HYPH    _   12  punct   12:punct    SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
14  16  16  NUM CD  NumType=Card    12  compound    12:compound SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
15  -   -   PUNCT   HYPH    _   16  punct   16:punct    SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
16  launched    launch  VERB    VBN Tense=Past|VerbForm=Part    17  acl 17:acl  Framefile=-|Roleset=-|Args=_/_/_
17  bomb    bomb    NOUN    NN  Number=Sing 9   obl 9:obl:with  SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/ARGM-MNR
18  .   .   PUNCT   .   _   9   punct   9:punct Framefile=-|Roleset=-|Args=_/_/_

arademaker commented 4 years ago

Thank you @alanakbik , I agree with have to think a little bit more about the final format. I actually ended up using the same format used for the other languages and improved the README file explaining that the .conllu files in this repo are not actual valid .conllu according to UD specifications.

This is a bad situation because the extension may let people believe that standard CoNNL-U readers can parse the files and it is not the case for now. I also don't like to have to deal with a variable number of columns per sentence. The format you suggest above seems to be very concise and it can be encoded in the MISC column.

Other options are:

encode the predicates in the SENTENCE metadata
adopted the CoNNL-U Plus

@huaiyu-zhu , @yunyaoli ?

nschneid commented 4 years ago

Or choose a different extension, e.g. .conllusrl?

alanakbik commented 4 years ago

I would probably vote against encoding this information in the sentence metadata. CoNLL-U plus or changing the extension are good solutions, but best might be to have this in valid CoNLL-U format since this is what most people/tools use.

So I like your way of encoding SRL in the MISC column, just perhaps the readability could be improved by encoding the arguments with a id-pointer system like in the Finnish Propbank or the enhanced dependency graph (that uses head-deprel pairs)?

UniversalPropositions / UP-1.0

adding English data from the merge of UD and Propbank #5