UniversalDependencies / UD_English-GUM

Other
30 stars 4 forks source link

Lemma for swing is swe #68

Closed AngledLuffa closed 10 months ago

AngledLuffa commented 12 months ago

There's a sentence with a lemma of swe for swing

Makes me wonder if this originally used CoreNLP, since there was a bug where it was lemmatizing swing into swe. Now this is causing Stanza to do the same thing. A self-referential data error...

# sent_id = GUM_textbook_sociology-49
# s_prominence = 2
# s_type = decl
# transition = establishment
# text = Sometimes when people attempt to rectify feelings of ethnocentrism and develop cultural relativism, they swing too far to the other end of the spectru\
m.
# newpar
# newpar_block = p (4 s)
1       Sometimes       sometimes       ADV     RB      _       16      advmod  16:advmod       Discourse=context-background:128->132:2
2       when    when    ADV     WRB     PronType=Int    4       advmod  4:advmod        Discourse=context-circumstance:129->131:0
3       people  person  NOUN    NNS     Number=Plur     4       nsubj   4:nsubj|6:nsubj:xsubj|11:nsubj:xsubj    Entity=(225-person-new-cf1-1-coref)
4       attempt attempt VERB    VBP     Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin   16      advcl   16:advcl:when   _
5       to      to      PART    TO      _       6       mark    6:mark  _
6       rectify rectify VERB    VB      VerbForm=Inf    4       xcomp   4:xcomp _
7       feelings        feeling NOUN    NNS     Number=Plur     6       obj     6:obj   Entity=(226-abstract-new-cf3-1-sgl
8       of      of      ADP     IN      _       9       case    9:case  _
9       ethnocentrism   ethnocentrism   NOUN    NN      Number=Sing     7       nmod    7:nmod:of       Entity=(227-abstract-new-cf6-1-coref)226)
10      and     and     CCONJ   CC      _       11      cc      11:cc   Discourse=joint-list_m:130->129:0
11      develop develop VERB    VB      VerbForm=Inf    6       conj    4:xcomp|6:conj:and      _
12      cultural        cultural        ADJ     JJ      Degree=Pos      13      amod    13:amod Entity=(2-abstract-giv:act-cf2*-2-coref
13      relativism      relativism      NOUN    NN      Number=Sing     11      obj     11:obj  Entity=2)|SpaceAfter=No
14      ,       ,       PUNCT   ,       _       4       punct   4:punct _
15      they    they    PRON    PRP     Case=Nom|Number=Plur|Person=3|PronType=Prs      16      nsubj   16:nsubj        Discourse=same-unit_m:131->128:0|Entity\
=(225-person-giv:act-cf1-1-ana)
16      swing   swe     VERB    VBP     Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin   0       root    0:root  _
17      too     too     ADV     RB      Degree=Pos      18      advmod  18:advmod       _
18      far     far     ADV     RB      Degree=Pos      16      advmod  16:advmod       _
19      to      to      ADP     IN      _       22      case    22:case _
20      the     the     DET     DT      Definite=Def|PronType=Art       22      det     22:det  Entity=(228-abstract-new-cf4-3-sgl
21      other   other   ADJ     JJ      Degree=Pos      22      amod    22:amod _
22      end     end     NOUN    NN      Number=Sing     18      obl     18:obl:to       _
23      of      of      ADP     IN      _       25      case    25:case _
24      the     the     DET     DT      Definite=Def|PronType=Art       25      det     25:det  Entity=(229-abstract-new-cf5-2-sgl
25      spectrum        spectrum        NOUN    NN      Number=Sing     22      nmod    22:nmod:of      Entity=229)228)|SpaceAfter=No
26      .       .       PUNCT   .       _       16      punct   16:punct        _
amir-zeldes commented 11 months ago

Hehe, that's detailed knowledge there! But looking at the specific document, this can't be CoreNLP, which was indeed used to generate the base lemmatization before manual correction up through maybe GUM v3 or so. This is a textbook document, and the dateCollected shows it was only added in 2021, so this was almost certainly lemmatized by Stanza itself (could be that same self-referential error...)

Most of these kinds of errors get weeded out during annotation, or they're caught later when we compare multiple tagger disagreements and adjudicate, but this was missed. Will fix!