UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
197 stars 41 forks source link

Inconsistent "about <time>" #515

Closed AngledLuffa closed 3 months ago

AngledLuffa commented 3 months ago

Came across the following:

# sent_id = email-enronsent17_01-0037
# text = What about 7:00 at the office or breakfast meeting at 7:00?
1       What    what    PRON    WP      PronType=Int    0       root    0:root  _
2       about   about   ADP     IN      _       3       case    3:case  _
3       7:00    7:00    NUM     CD      NumForm=Digit|NumType=Card      1       nmod    1:nmod:about    _
4       at      at      ADP     IN      _       6       case    6:case  _
5       the     the     DET     DT      Definite=Def|PronType=Art       6       det     6:det   _
6       office  office  NOUN    NN      Number=Sing     3       nmod    3:nmod:at       _
7       or      or      CCONJ   CC      _       9       cc      9:cc    _
8       breakfast       breakfast       NOUN    NN      Number=Sing     9       compound        9:compound      _
9       meeting meeting NOUN    NN      Number=Sing     3       conj    1:nmod:about|3:conj:or  _
10      at      at      ADP     IN      _       11      case    11:case _
11      7:00    7:00    NUM     CD      NumForm=Digit|NumType=Card      9       nmod    9:nmod:at       SpaceAfter=No
12      ?       ?       PUNCT   .       _       1       punct   1:punct _

vs

# sent_id = email-enronsent17_01-0043
7       at      at      ADP     IN      _       9       case    9:case  _
8       about   about   ADV     RB      _       9       advmod  9:advmod        _
9       945     945     NUM     CD      NumForm=Digit|NumType=Card      5       obl     5:obl:at        _
10      to      to      PART    TO      _       11      mark    11:mark _
11      catch   catch   VERB    VB      VerbForm=Inf    5       advcl   5:advcl:to      _
12      a       a       DET     DT      Definite=Ind|PronType=Art       13      det     13:det  _
13      plane   plane   NOUN    NN      Number=Sing     11      obj     11:obj  SpaceAfter=No
# sent_id = email-enronsent17_01-0052
# text = What about 10:30 my time?
1       What    what    PRON    WP      PronType=Int    0       root    0:root  _
2       about   about   ADP     IN      _       3       case    3:case  _
3       10:30   10:30   NUM     CD      NumForm=Digit|NumType=Card      1       nmod    1:nmod:about    _
4       my      my      PRON    PRP$    Case=Gen|Number=Sing|Person=1|Poss=Yes|PronType=Prs     5       nmod:poss       5:nmod:poss     _
5       time    time    NOUN    NN      Number=Sing     3       nmod:tmod       3:nmod:tmod     SpaceAfter=No
6       ?       ?       PUNCT   .       _       1       punct   1:punct _

This also looks different:

# sent_id = email-enronsent43_01-0017
22      about   about   ADV     RB      _       23      advmod  23:advmod       _
23      50      50      NUM     CD      NumForm=Digit|NumType=Card      24      nummod  24:nummod       SpaceAfter=No
24      ft      ft      NOUN    NNS     Number=Plur     21      obj     21:obj  _
# sent_id = newsgroup-groups.google.com_alt.animals_02c2d614bfbf6b20_ENG_20050223_232900-0054
1       There   there   PRON    EX      _       2       expl    2:expl  _
2       are     be      VERB    VBP     Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin   0       root    0:root  _
3       only    only    ADV     RB      _       5       advmod  5:advmod        _
4       about   about   ADV     RB      _       5       advmod  5:advmod        _
5       850     850     NUM     CD      NumForm=Digit|NumType=Card      6       nummod  6:nummod        _
6       wolves  wolf    NOUN    NNS     Number=Plur     2       nsubj   2:nsubj _
# sent_id = answers-20111107152509AA78ktV_ans-0009
19      I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      20      nsubj   20:nsubj        _
20      came    come    VERB    VBD     Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin   8       parataxis       8:parataxis     _
21      here    here    ADV     RB      PronType=Dem    20      advmod  20:advmod       _
22      about   about   ADV     IN      _       23      advmod  23:advmod       _
23      12      12      NUM     CD      NumForm=Digit|NumType=Card      24      nummod  24:nummod       _
24      years   year    NOUN    NNS     Number=Plur     25      obl:npmod       25:obl:npmod    _
25      ago     ago     ADV     RB      _       20      advmod  20:advmod       _
# sent_id = answers-20111108073322AA27tkh_ans-0003
28      from    from    ADP     IN      _       30      case    30:case _
29      the     the     DET     DT      Definite=Def|PronType=Art       30      det     30:det  _
30      end     end     NOUN    NN      Number=Sing     27      nmod    27:nmod:from    _
31      of      of      ADP     IN      _       34      case    34:case _
32      the     the     DET     DT      Definite=Def|PronType=Art       34      det     34:det  _
33      13th    13th    ADJ     JJ      Degree=Pos|NumForm=Combi|NumType=Ord    34      amod    34:amod _
34      century century NOUN    NN      Number=Sing     30      nmod    30:nmod:of      _
35      to      to      ADP     IN      _       37      case    37:case _
36      about   about   ADV     RB      _       37      advmod  37:advmod       _
37      1600    1600    NUM     CD      NumForm=Digit|NumType=Card      27      nmod    27:nmod:to      SpaceAfter=No

Mostly this came up because I was trying to figure out how to convert this constituent to dependencies

                        (NP
                          (QP (IN about) (CD 900) )
                          (NN pence) )

but I'm not figuring out a proper pattern from flipping through EWT. It's possible the about_IN in PTB is not the same standard used in EWT, though

nschneid commented 3 months ago

"What about X" is a way to ask a question, where X may be anything—no approximation involved. So it makes sense that that has a different structure from "at about TIME". Is that the main difference you noticed?

AngledLuffa commented 3 months ago

Ok, that makes sense. However, in I came here about 12 years ago, about has xpos IN and upos ADV. Normally I would expect ADP as the upos for IN. There are 16 total cases between train/dev/test with IN and ADV, though.

Is this a case where it'd be fine for the CoreNLP converter to editorialize the tags to be ADV even though there's an IN xpos (https://github.com/UniversalDependencies/docs/issues/717) or would it make more sense to unify the tags in the EWT treebank? Personally I would go with the latter and accept that there will be a few unfixable validation errors in the converter.

nschneid commented 3 months ago

EWT has just 4 of these ADV/IN combos: https://universal.grew.fr/?custom=6609ff40eada0

I think they're errors, will fix.

nschneid commented 3 months ago

Made a more general issue: #516