UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
197 stars 41 forks source link

Lemmas for typos #471

Closed AngledLuffa closed 8 months ago

AngledLuffa commented 8 months ago

Found some cases of inconsistency in the lemmas for typos

# sent_id = weblog-blogspot.com_rigorousintuition_20060511134300_ENG_20060511_134300-0260
# newpar id = weblog-blogspot.com_rigorousintuition_20060511134300_ENG_20060511_134300-p0065
# text = Of the metal music farce, Ministry has definately made in-road into 'Loose Change"-"Alex Jones" territory with their new record, Rio Grande Blood.
9       definately      definately      ADV     RB      _       10      advmod  10:advmod       _

# sent_id = answers-20111106215236AAycANO_ans-0006
# text = You can definately plan on a great meal with reasonable prices.
3       definately      definately      ADV     RB      _       4       advmod  4:advmod        _

# sent_id = reviews-231203-0006
# text = However, the bartenders/waitresses definately need to be re-trained (if they ever had any to begin with) and learn two ...
7       definately      definitely      ADV     RB      Typo=Yes        8       advmod  8:advmod        CorrectForm=definitely

Similarly, recieved is sometimes corrected, sometimes not

25      recieved        recieve VERB    VBN     Tense=Past|VerbForm=Part        11      conj    11:conj|51:advcl:if     _
3       recieved        recieve VERB    VBD     Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   9       advcl   9:advcl:since   _
24      recieved        receive VERB    VBD     Mood=Ind|Number=Sing|Person=1|Tense=Past|Typo=Yes|VerbForm=Fin  22      acl:relcl       22:acl:relcl    CorrectForm=received
10      recieved        receive VERB    VBD     Mood=Ind|Number=Plur|Person=1|Tense=Past|Typo=Yes|VerbForm=Fin  3       advcl   3:advcl:when    CorrectForm=received

Oct as an abbreviation when not treated as a date (not sure why, honestly, usually lemmatized as October). Happens 3x:

# sent_id = email-enronsent24_01-0014
# text = Mary <<MEH-risk Oct 20>>
1       Mary    Mary    PROPN   NNP     Number=Sing     0       root    0:root  _
2       <<      <<      PUNCT   -LRB-   _       3       punct   3:punct SpaceAfter=No
3       MEH-risk        meh-riskoct20   NOUN    GW      Number=Sing|Typo=Yes    1       parataxis       1:parataxis     _
4       Oct     _       X       GW      _       3       goeswith        3:goeswith      _
5       20      _       X       NN      _       3       goeswith        3:goeswith      SpaceAfter=No
6       >>      >>      PUNCT   -RRB-   _       3       punct   3:punct _

# sent_id = email-enronsent24_01-0015
# newpar id = email-enronsent24_01-p0004
# text = - MEH-risk Oct 20.doc
1       -       -       PUNCT   NFP     _       2       punct   2:punct _
2       MEH-risk        meh-risk        NOUN    GW      _       0       root    0:root  _
3       Oct     oct     X       GW      _       2       flat    2:flat  _
4       20.doc  20.doc  X       NN      Number=Sing     2       flat    2:flat  _

this should probably be a typo for patience? or maybe it's meant to be a pun, in which case the lemma should be patient

# sent_id = answers-20111108111031AARG57j_ans-0016
# text = Yes, lots of time and lots of patients...
1       Yes     yes     INTJ    UH      _       3       discourse       3:discourse     SpaceAfter=No
2       ,       ,       PUNCT   ,       _       1       punct   1:punct _
3       lots    lot     NOUN    NNS     Number=Plur     0       root    0:root  _
4       of      of      ADP     IN      _       5       case    5:case  _
5       time    time    NOUN    NN      Number=Sing     3       nmod    3:nmod:of       _
6       and     and     CCONJ   CC      _       7       cc      7:cc    _
7       lots    lot     NOUN    NNS     Number=Plur     3       conj    3:conj:and      _
8       of      of      ADP     IN      _       9       case    9:case  _
9       patients        patients        NOUN    NN      Number=Sing     7       nmod    7:nmod:of       SpaceAfter=No
10      ...     ...     PUNCT   .       _       3       punct   3:punct SpaceAfter=No

and then this enchiladas is an NNS and should be singular in the lemma

# sent_id = reviews-150192-0003
# text = The sauce was dry and the enchiladas did not taste good.at all.
1       The     the     DET     DT      Definite=Def|PronType=Art       2       det     2:det   _
2       sauce   sauce   NOUN    NN      Number=Sing     4       nsubj   4:nsubj _
3       was     be      AUX     VBD     Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   4       cop     4:cop   _
4       dry     dry     ADJ     JJ      Degree=Pos      0       root    0:root  _
5       and     and     CCONJ   CC      _       10      cc      10:cc   _
6       the     the     DET     DT      Definite=Def|PronType=Art       7       det     7:det   _
7       enchiladas      enchiladas      NOUN    NN      Number=Sing     10      nsubj   10:nsubj|11:nsubj:xsubj _
8       did     do      AUX     VBD     Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   10      aux     10:aux  _
9       not     not     PART    RB      _       10      advmod  10:advmod       _
10      taste   taste   VERB    VB      VerbForm=Inf    4       conj    4:conj:and      _
11      good    good    ADJ     JJ      Degree=Pos      10      xcomp   10:xcomp        SpaceAfter=No
12      .       .       PUNCT   .       _       10      punct   10:punct        SpaceAfter=No
13      at      at      ADP     IN      _       14      case    14:case _
14      all     all     DET     DT      PronType=Tot    10      obl     10:obl:at       SpaceAfter=No
15      .       .       PUNCT   .       _       4       punct   4:punct _

I would send in a PR, but it's the data freeze, after all

AngledLuffa commented 8 months ago

also, in the test set, there's reciept

nschneid commented 8 months ago

Thanks for pointing these out, let's fix after the data freeze

Oct as an abbreviation when not treated as a date (not sure why, honestly, usually lemmatized as October). Happens 3x:

Just in filenames, when tagged as X, maybe? We should really figure out a consistent policy for filenames; I'm not sure we want to analyze them as containing individual "words" in the first place (UniversalDependencies/docs#666).

this should probably be a typo for patience?

Yes, it is spelled correctly 3 sentences later

AngledLuffa commented 8 months ago

this should probably be a typo for patience?

Yes, it is spelled correctly 3 sentences later

You could say I wasn't ... determined? ... enough to find that