UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
198 stars 42 forks source link

mismatches in the tokenization of UD EWT vs Propbank EWT #283

Closed arademaker closed 2 years ago

arademaker commented 2 years ago

Comparing the UD EWT and Propbank EWT, I found 10 cases of tokenization mismatches. In all cases of the first code block below, I believe UD EWT is doing the right thing. In answers-20111107175720AAlb2TB_ans-0015 maybe we need the CorectForm=basically in the MISC.

1) 
# sent_id = answers-20111107175720AAlb2TB_ans-0015
# text = The Irish weather works on the ‘four seasons in a day’ principle, which basic<U+00AD>ally means that you can’t predict a thing when it comes to the behaviour of the sky.
...
17      basic<U+00AD>ally       basically       ADV     RB      _       18      advmod  18:advmod       _
...

...
google/ewt/answers/00/20111107175720AAlb2TB_ans.xml  14   16        basic    GW            (S(ADVP*         -            -        *   (ARGM-ADV*             *             *
google/ewt/answers/00/20111107175720AAlb2TB_ans.xml  14   17         ally    RB                   *)        -            -        *            *)            *             *
...

2)
# sent_id = answers-20111108110012AAK8Azy_ans-0029
# newpar id = answers-20111108110012AAK8Azy_ans-p0004
# text = mm it depends on the size of his tank or cage or whatever you have him in..atleast as far as i'm concerned.
...
19      at      at      ADP     IN      ExtPos=ADV      22      advmod  22:advmod       CorrectSpaceAfter=Yes|SpaceAfter=No
20      least   least   ADJ     JJS     Degree=Sup      19      fixed   19:fixed        _
...

...
google/ewt/answers/00/20111108110012AAK8Azy_ans.xml  28   18      atleast     RB    (ADVP(ADVP*         -            -    (ARGM-ADV*         *         *         *
...

3)
# sent_id = reviews-042012-0006
# text = I will 4-ever be eternally grateful for their hospitality and luv that my Sicilian family showed me when I was there for 3 years.
1       I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      6       nsubj   6:nsubj _
2       will    will    AUX     MD      VerbForm=Fin    6       aux     6:aux   _
3       4-ever  forever ADV     RB      Abbr=Yes        6       advmod  6:advmod        _
...

google/ewt/reviews/00/042012.xml  5    0              I    PRP      (TOP(S(NP*)         -             -        (ARG1*)       (ARG0*)           *             *
google/ewt/reviews/00/042012.xml  5    1           will     MD            (VP*          -             -    (ARGM-MOD*)   (ARGM-MOD*)           *             *
google/ewt/reviews/00/042012.xml  5    2             4      GW          (ADVP*          -             -    (ARGM-TMP*    (ARGM-TMP*            *             *
google/ewt/reviews/00/042012.xml  5    3             -    HYPH               *          -             -             *             *            *             *
google/ewt/reviews/00/042012.xml  5    4           ever     RB               *)         -             -             *)            *)           *             *
...

4)
# sent_id = answers-20111107094641AATmyb6_ans-0036
# text = Have a good trip where-ever you end up.
...
5       where-ever      wherever        ADV     WRB     Typo=Yes        7       advmod  7:advmod        CorrectForm=wherever
...

...
google/ewt/answers/00/20111107094641AATmyb6_ans.xml  35    4    where     GW  (SBAR(WHADVP*      -           -             *        (ARG1*    (ARG2*
google/ewt/answers/00/20111107094641AATmyb6_ans.xml  35    5       -    HYPH              *      -           -             *             *         *
google/ewt/answers/00/20111107094641AATmyb6_ans.xml  35    6     ever    WRB              *)     -           -             *             *         *)
...

5)
# sent_id = email-enronsent32_01-0050
# text = My new address will be: Mary Hain Senior Regulatory Counsel ISO New England Inc.One Sullivan Road Holyoke, MA 01040-2841 (413) 535-4000 mhain@ISO-NE.com
...
15      Inc.    Inc.    PROPN   NNP     Number=Sing     8       list    8:list  CorrectSpaceAfter=Yes|SpaceAfter=No
16      One     One     NUM     NNP     NumType=Card    17      nummod  17:nummod       _
...

...
google/ewt/email/00/enronsent32_01.xml  49   14             Inc.One     NNP           *)   -       -             *
...

For the next block, I just want to confirm the decisions in UD EWT:

6)
# sent_id = reviews-096340-0002
# text = Hancocks is one of four fabric stores in Fort Smith.
1-2     Hancocks        _       _       _       _       _       _       _       _
1       Hancock Hancock PROPN   NNP     Number=Sing     4       nsubj   4:nsubj _
2       s       's      PART    POS     Typo=Yes        1       case    1:case  CorrectForm='s
...

google/ewt/reviews/00/096340.xml  1    0    Hancocks   NNP   (TOP(S(NP*)   -       -    (ARG1*)
google/ewt/reviews/00/096340.xml  1    1          is   VBZ         (VP*    be  be.01       (V*)
...

7)
# sent_id = reviews-009389-0003
# text = it is now bislas.
1       it      it      PRON    PRP     Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  4       nsubj   4:nsubj _
2       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4       cop     4:cop   _
3       now     now     ADV     RB      _       4       advmod  4:advmod        _
4-5     bislas  _       _       _       _       _       _       _       SpaceAfter=No
4       bisla   bisla   PROPN   NNP     Number=Sing     0       root    0:root  _
5       s       's      PART    POS     Typo=Yes        4       case    4:case  CorrectForm='s
6       .       .       PUNCT   .       _       4       punct   4:punct _

google/ewt/reviews/00/009389.xml  2   0        it   PRP   (TOP(S(NP*)   -       -        (ARG1*)
google/ewt/reviews/00/009389.xml  2   1        is   VBZ         (VP*    be  be.01           (V*)
google/ewt/reviews/00/009389.xml  2   2       now    RB       (ADVP*)   -       -    (ARGM-TMP*)
google/ewt/reviews/00/009389.xml  2   3    bislas   NNP        (NP*))   -       -        (ARG2*)
google/ewt/reviews/00/009389.xml  2   4         .     .           *))   -       -             *

8)
# sent_id = reviews-209465-0001
# newpar id = reviews-209465-p0001
# text = Pho-nomenal!!
1       Pho-nomenal     phenomenal      ADJ     JJ      Degree=Pos|Style=Expr   0       root    0:root  CorrectForm=Phenomenal|SpaceAfter=No
2       !!      !!      PUNCT   .       _       1       punct   1:punct _

google/ewt/reviews/00/209465.xml  0   0        Pho     NN  (TOP(FRAG(ADJP*   -   -
google/ewt/reviews/00/209465.xml  0   1         -    HYPH                *   -   -
google/ewt/reviews/00/209465.xml  0   2    nomenal     JJ                *)  -   -
google/ewt/reviews/00/209465.xml  0   3         !!      .               *))  -   -

9)
# sent_id = answers-20111108104724AAuBUR7_ans-0032
# text = Worse comes to worse, I'm out of the $$ and have to put him down anyway or have a pretty pasture ornament.
...
11      $$      $$      SYM     $       _       0       root    0:root  _
...

...
google/ewt/answers/00/20111108104724AAuBUR7_ans.xml  31   10           $     $              *      -         -         *            *      *             *            *
google/ewt/answers/00/20111108104724AAuBUR7_ans.xml  31   11           $     $           *))))     -         -         *            *)     *             *            *
...

10)
# sent_id = answers-20111106213308AA5Nh2g_ans-0008
# text = You voted on the Dominion Posts website.
1       You     you     PRON    PRP     Case=Nom|Person=2|PronType=Prs  2       nsubj   2:nsubj _
2       voted   vote    VERB    VBD     Mood=Ind|Number=Sing|Person=2|Tense=Past|VerbForm=Fin   0       root    0:root  _
3       on      on      ADP     IN      _       8       case    8:case  _
4       the     the     DET     DT      Definite=Def|PronType=Art       8       det     8:det   _
5       Dominion        Dominion        PROPN   NNP     Number=Sing     6       compound        6:compound      _
6-7     Posts   _       _       _       _       _       _       _       _
6       Post    Post    PROPN   NNP     Number=Sing     8       compound        8:compound      _
7       s       's      PART    POS     Typo=Yes        6       case    6:case  CorrectForm='s
8       website website NOUN    NN      Number=Sing     2       obl     2:obl:on        SpaceAfter=No
9       .       .       PUNCT   .       _       2       punct   2:punct _

google/ewt/answers/00/20111106213308AA5Nh2g_ans.xml  7   0         You    PRP   (TOP(S(NP*)     -         -       (ARG0*)
google/ewt/answers/00/20111106213308AA5Nh2g_ans.xml  7   1       voted    VBD         (VP*    vote  vote.01          (V*)
google/ewt/answers/00/20111106213308AA5Nh2g_ans.xml  7   2          on     IN         (PP*      -         -   (ARGM-LOC*
google/ewt/answers/00/20111106213308AA5Nh2g_ans.xml  7   3         the     DT         (NP*      -         -            *
google/ewt/answers/00/20111106213308AA5Nh2g_ans.xml  7   4    Dominion    NNP        (NML*      -         -            *
google/ewt/answers/00/20111106213308AA5Nh2g_ans.xml  7   5       Posts   NNPS            *)     -         -            *
google/ewt/answers/00/20111106213308AA5Nh2g_ans.xml  7   6     website     NN          *)))     -         -            *)
google/ewt/answers/00/20111106213308AA5Nh2g_ans.xml  7   7           .      .           *))     -         -            *
nschneid commented 2 years ago

Seems like the issue is on the PropBank side.