UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

NumType for a few random cases #462

Open AngledLuffa opened 11 months ago

AngledLuffa commented 11 months ago

This is written with European style . between thousands, which is different from the rest of EWT. Generally commas are removed in the lemmas, so I suppose this should have a lemma of 10000000

# sent_id = weblog-blogspot.com_marketview_20060625150800_ENG_20060625_150800-0010
# text = Because the 10.000.000 people dying from malaria will otherwise be dead.
3       10.000.000      10.000.000      NUM     CD      NumForm=Digit|NumType=Card      4       nummod  4:nummod        _

Phone numbers are unusual:

# sent_id = email-enronsent21_02-0054
# text = Chris Abel Manager, Risk Controls Global Risk Operations chris.abel@enron.com <mailto:chris.abel@enron.com> 713.853.3102
14      713.853.3102    713.853.3102    NUM     CD      NumForm=Digit|NumType=Card      1       list    1:list  _
# sent_id = email-enronsent12_01-0010
# newpar id = email-enronsent12_01-p0005
# text = Cindy Franklin Transportation Services Work:832.676.3177 Fax: 832.676.1329 Pager: 1.888.509.3736
7       832.676.3177    832.676.3177    NUM     CD      NumForm=Digit|NumType=Card      5       appos   5:appos _

There's also section numbers...

# sent_id = email-enronsent38_01-0009
# text = 7. Delete the definition of Costs as it is already defined in Section 6.2.1.
15      6.2.1   6.2.1   NUM     CD      NumForm=Digit|NumType=Card      14      nummod  14:nummod       SpaceAfter=No

whereas the change I just submitted to update NumType=Frac for a bunch of numbers changed section numbers with 2 numbers to Frac:

# sent_id = email-enronsent44_01-0080
# newpar id = email-enronsent44_01-p0036
# text = With reference to Article 5, section 5.1 (b), are we going to propose an alternative planned outage to TAU for next year?
8       5.1     5.1     NUM     CD      NumForm=Digit|NumType=Frac      7       nummod  7:nummod        _

so I think there's some room for improvement there

nschneid commented 11 months ago

This looks to me like more evidence that maybe we don't need NumType=Frac and it should just be merged with NumType=Card. Trying to discern whether "." separates sections and subsections versions whole numbers and fractional parts seems like overkill for a morphological representation.

AngledLuffa commented 11 months ago

@amir-zeldes @rhdunn thoughts on this? I believe GUM has NumType=Frac

rhdunn commented 11 months ago

These are not all strictly cardinal numbers, which are defined as the counting/natural numbers [1] [2]. The ordinal numbers are equivalent for positions/ordering.

We have 3 general types/groups here:

  1. cardinal numbers (including dotted European style and phone numbers, and comma separatated) -- a downstream processor just needs to remove the dots and commas (or get that from the lemma) to get the cardinal number value.
  2. fractional/decimal numbers -- a downstream processor would need to convert these to a decimal/double value.
  3. section numbers (e.g. email-enronsent44_01-0080) -- a downstream processor would need to split these into separate cardinal numbers to get the section hierarchy to navigate.

Note also that one of the treebanks -- I can't recall which -- has a case of hundredth marked as NumType=Frac instead of NumType=Ord.

It would be helpful if the data could differentiate these types of number. They are separate morphological features. For example, the cardinals would have a lemma that removes the dots and commas from their form, and the fractional and section numbers don't as in the fractional case the dot is important.

With the section case, I've also been meaning to raise an issue around the grouping of single letter abbreviations, like in A.A. Milne. -- It would be helpful if these (like the section case) are separate tokens, as they are separate words, or in the section case separate numbers.

[1] https://www.merriam-webster.com/dictionary/cardinal%20number [2] https://en.wikipedia.org/wiki/Cardinal_number

nschneid commented 11 months ago

https://github.com/UniversalDependencies/docs/issues would be a good place for discussion of the inventory of NumType values (they are not specific to English).

Tokenization tends to be language-specific. In general, for English, I would expect to separate things that are either (1) sentence-organizing punctuation (commas, quotation marks, etc.), (2) clitics, (3) hyphenated linguistic words, or (4) units often/usually written with a space between them. (Or other things where there's a well-established history of separating them in tokenizers, such as currency symbols with numbers.) Of course there will be difficult cases, but in general I do not see tokenization or UD syntax as a way to express the full "grammar" of subsystems like numerical dates or numerical section-subsection notation.

amir-zeldes commented 11 months ago

@amir-zeldes @rhdunn thoughts on this? I believe GUM has NumType=Frac

It's a long-standing UD feature, so I would keep it. I don't think it's very difficult to recognize in practice. Even done fully automatically it would have fewer errors than many other things we have going on.

nschneid commented 11 months ago

@amir-zeldes are you saying "713.853.3102" as a telephone number and "5.1" as a subsection number should be Card but "5.1" as a value should be Frac, even though they're pronounced the same? We still run into problems with tokenization ("1 / 2" or "3 %": Frac or Card? UniversalDependencies/UD_English-EWT#337). But for now I will try to implement the least disruptive policy.

amir-zeldes commented 11 months ago

Yes, that sounds right to me. If UD has a "Frac" number type, then it should apply only to things that are actually fractions. Section numbers can have even more hierarchy, and I think we'd all agree that "5.1.1" is not a fraction. It's a coincidence that section numbers can be homographs of fractions, but there are all sorts of homographs out there that have to be tagged differently, and we still stick to the same basic distinctions the tagset makes, so I don't see why this would be different.