Open AngledLuffa opened 11 months ago
This looks to me like more evidence that maybe we don't need NumType=Frac
and it should just be merged with NumType=Card
. Trying to discern whether "." separates sections and subsections versions whole numbers and fractional parts seems like overkill for a morphological representation.
@amir-zeldes @rhdunn thoughts on this? I believe GUM has NumType=Frac
These are not all strictly cardinal numbers, which are defined as the counting/natural numbers [1] [2]. The ordinal numbers are equivalent for positions/ordering.
We have 3 general types/groups here:
Note also that one of the treebanks -- I can't recall which -- has a case of hundredth
marked as NumType=Frac
instead of NumType=Ord
.
It would be helpful if the data could differentiate these types of number. They are separate morphological features. For example, the cardinals would have a lemma that removes the dots and commas from their form, and the fractional and section numbers don't as in the fractional case the dot is important.
With the section case, I've also been meaning to raise an issue around the grouping of single letter abbreviations, like in A.A. Milne
. -- It would be helpful if these (like the section case) are separate tokens, as they are separate words, or in the section case separate numbers.
[1] https://www.merriam-webster.com/dictionary/cardinal%20number [2] https://en.wikipedia.org/wiki/Cardinal_number
https://github.com/UniversalDependencies/docs/issues would be a good place for discussion of the inventory of NumType
values (they are not specific to English).
Tokenization tends to be language-specific. In general, for English, I would expect to separate things that are either (1) sentence-organizing punctuation (commas, quotation marks, etc.), (2) clitics, (3) hyphenated linguistic words, or (4) units often/usually written with a space between them. (Or other things where there's a well-established history of separating them in tokenizers, such as currency symbols with numbers.) Of course there will be difficult cases, but in general I do not see tokenization or UD syntax as a way to express the full "grammar" of subsystems like numerical dates or numerical section-subsection notation.
@amir-zeldes @rhdunn thoughts on this? I believe GUM has NumType=Frac
It's a long-standing UD feature, so I would keep it. I don't think it's very difficult to recognize in practice. Even done fully automatically it would have fewer errors than many other things we have going on.
@amir-zeldes are you saying "713.853.3102" as a telephone number and "5.1" as a subsection number should be Card
but "5.1" as a value should be Frac
, even though they're pronounced the same? We still run into problems with tokenization ("1 / 2" or "3 %": Frac
or Card
? UniversalDependencies/UD_English-EWT#337). But for now I will try to implement the least disruptive policy.
Yes, that sounds right to me. If UD has a "Frac" number type, then it should apply only to things that are actually fractions. Section numbers can have even more hierarchy, and I think we'd all agree that "5.1.1" is not a fraction. It's a coincidence that section numbers can be homographs of fractions, but there are all sorts of homographs out there that have to be tagged differently, and we still stick to the same basic distinctions the tagset makes, so I don't see why this would be different.
This is written with European style
.
between thousands, which is different from the rest of EWT. Generally commas are removed in the lemmas, so I suppose this should have a lemma of10000000
Phone numbers are unusual:
There's also section numbers...
whereas the change I just submitted to update
NumType=Frac
for a bunch of numbers changed section numbers with 2 numbers toFrac
:so I think there's some room for improvement there