UniversalDependencies / UD_English-GUM

Other
30 stars 4 forks source link

NumType / NumForm for phone numbers #59

Closed AngledLuffa closed 1 year ago

AngledLuffa commented 2 years ago

Should phone numbers have their own NumType & NumForm?

For example:

# sent_id = GUM_voyage_fortlee-34
# s_prominence = 2
# s_type = frag
# transition = null
# text = Babe's 185 Bridge Plaza North, tel: 1 201 944-6800.
# newpar
# newpar_block = item (1 s)
1-2     Babe's  _       _       _       _       _       _       _       _
1       Babe    Babe    PROPN   NNP     Number=Sing     0       root    0:root  Bridge=106<110|Discourse=joint-list_m:71->69:1|Entity=(110-organization-acc:inf\
-cf2-1-sgl|XML=<hi rend:::"bold"><ref>
2       's      's      PART    POS     _       1       case    1:case  Entity=110)|XML=</ref></hi>
3       185     185     NUM     CD      NumForm=Digit|NumType=Card      4       nummod  4:nummod        Entity=(111-place-new-cf3-2,3,4-sgl
4       Bridge  Bridge  PROPN   NNP     Number=Sing     5       compound        5:compound      _
5       Plaza   Plaza   PROPN   NNP     Number=Sing     1       list    1:list  _
6       North   North   ADJ     NNP     Degree=Pos|Number=Sing  5       amod    5:amod  Entity=111)|SpaceAfter=No
7       ,       ,       PUNCT   ,       _       8       punct   8:punct _
8       tel     telephone       NOUN    NN      Number=Sing     10      nsubj   10:nsubj        Entity=(112-abstract-new-cf1-1-coref)|SpaceAfter=No
9       :       :       PUNCT   :       _       8       punct   8:punct _
10      1       1       NUM     CD      NumForm=Digit|NumType=Card      1       list    1:list  Discourse=elaboration-additional:72->71:0|Entity=(112-abstract-\
giv:act-cf1-1,2,3-coref
11      201     201     NUM     CD      NumForm=Digit|NumType=Card      10      flat    10:flat _
12      944-6800        944-6800        NUM     CD      NumForm=Word|NumType=Card       10      flat    10:flat Entity=112)|SpaceAfter=No
13      .       .       PUNCT   .       _       1       punct   1:punct _
nschneid commented 2 years ago

Looking at NumForm and NumType, NumType=Card explicitly mentions phone numbers, and NumForm appears to be about the character set used more than the grammatical status of the word. So I thinkNumForm=Digit|NumType=Card is correct for phone numbers. I don't think the hyphen should trigger NumForm=Word because there are no alphabetic characters.

AngledLuffa commented 2 years ago

Neat. What about version numbers (apologies for overloading the issue)? Such as, Adobe Acrobat 3.0

Is a phone number written as 888.123.4567 still NumForm=Digit|NumType=Card?

nschneid commented 2 years ago

I think NumForm=Digit|NumType=Card for both. Separator/chunking characters do not change that the content of the number is expressed in digits, and it is not a fraction or ordinal.

AngledLuffa commented 2 years ago

Section numbers are the same, or just leave those unlabeled?

Section 3.0

nschneid commented 2 years ago

Same. Don't overthink the semantics since NumType=Card explicitly includes entity-like numbers as well as quantities.