Closed AngledLuffa closed 2 years ago
I forget... what did we decide for things like 1960s, 1990s, etc? Leave them as NOUN/NNS? Do anything with NumForm and NumType? (Currently they are
Number=Plur
and no other features)
I think NOUN/NNS with NumForm=Digit|NumType=Card
How about
M5J 1S9
? Ignore for now?
Yeah for now
Catch - 22s
with22s
as its own token?
Ooh. Tricky because the meaning is so non-compositional. I guess the policy is to tokenize hyphens except for productive prefixes and suffixes, which this isn't...so I guess "22s" should be treated like "1960s".
Alright, did decades and catch-22s
Just a remark (I didn't follow all the discussion): Quite often numerals are used as PROPNs. One solution would be to have upos=NUM, but to use ExtPos=PROPN. It is a way to indicate that the numeral is a NUM but it behaves as a PROPN and for its governor it appears as a PROPN. It is what ExtPos means (ExtPos = external POS).
@sylvainkahane I think UD's PROPN
category is problematic; it is easier to make the common vs. proper distinction at the phrase level, so we could consider ExtPos=PROPN
in the future. For now, we are limiting PROPN
to words in a proper name that would otherwise be NOUN
, while any word consisting of digits is automatically NUM
.
@AngledLuffa How hard would it be to add a feature in the MISC column to flag years, so that in the future we could perform additional processing? E.g. Year=Yes
(both for full years and decades, whether NUM or NOUN).
I think NOUN/NNS with NumForm=Digit|NumType=Card
Agreed, I wanted to add this is how LDC corpora behave as well, and it looks like GUM also follows this without exception.
How hard would it be to add a feature in the MISC column to flag years
If you need test data for a heuristic, note that GUM has absolute time expression resolution, so if something is a year you can see that in the <date when="yyyy(-mm-dd)?">
XML annotations, which also appear in MISC.
Also if you want to do this automatically using NLP, @nitinvwaran built a nice ensemble tool over HeidelTime and a couple of other resolvers to tag the AMALGUM corpus, which you can find here
@amir-zeldes thanks, I am not proposing we do full date tagging right now, just that if @AngledLuffa has already identified the years in the course of adding numeric features, we might as well mark them as years for future reference
I am not proposing we do full date tagging right now
No, I didn't think that - I just meant you could do a sanity check (for precision) and catch missing cases by running an NLP tool (for recall), or test the heuristic on GUM if you like.
I am not proposing we do full date tagging right now, just that if @AngledLuffa has already identified the years in the course of adding numeric features, we might as well mark them as years for future reference
That is not what I did, though... years which were entirely digits just got caught up in the regular expression to label all tokens which were entirely digits. The ones with apostrophes or s at the end are separate, but easy enough to find again
OK then never mind about a MISC feature.
Is this ready to merge? Can I squash?
Suggest not squashing, but good to merge
I forget... what did we decide for things like 1960s, 1990s, etc? Leave them as NOUN/NNS? Do anything with NumForm and NumType? (Currently they are
Number=Plur
and no other features)How about
M5J 1S9
? Ignore for now?Catch - 22s
with22s
as its own token?