UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
201 stars 43 forks source link

Num years #344

Closed AngledLuffa closed 2 years ago

AngledLuffa commented 2 years ago

I forget... what did we decide for things like 1960s, 1990s, etc? Leave them as NOUN/NNS? Do anything with NumForm and NumType? (Currently they are Number=Plur and no other features)

How about M5J 1S9? Ignore for now?

Catch - 22s with 22s as its own token?

nschneid commented 2 years ago

I forget... what did we decide for things like 1960s, 1990s, etc? Leave them as NOUN/NNS? Do anything with NumForm and NumType? (Currently they are Number=Plur and no other features)

I think NOUN/NNS with NumForm=Digit|NumType=Card

How about M5J 1S9? Ignore for now?

Yeah for now

Catch - 22s with 22s as its own token?

Ooh. Tricky because the meaning is so non-compositional. I guess the policy is to tokenize hyphens except for productive prefixes and suffixes, which this isn't...so I guess "22s" should be treated like "1960s".

AngledLuffa commented 2 years ago

Alright, did decades and catch-22s

sylvainkahane commented 2 years ago

Just a remark (I didn't follow all the discussion): Quite often numerals are used as PROPNs. One solution would be to have upos=NUM, but to use ExtPos=PROPN. It is a way to indicate that the numeral is a NUM but it behaves as a PROPN and for its governor it appears as a PROPN. It is what ExtPos means (ExtPos = external POS).

nschneid commented 2 years ago

@sylvainkahane I think UD's PROPN category is problematic; it is easier to make the common vs. proper distinction at the phrase level, so we could consider ExtPos=PROPN in the future. For now, we are limiting PROPN to words in a proper name that would otherwise be NOUN, while any word consisting of digits is automatically NUM.

@AngledLuffa How hard would it be to add a feature in the MISC column to flag years, so that in the future we could perform additional processing? E.g. Year=Yes (both for full years and decades, whether NUM or NOUN).

amir-zeldes commented 2 years ago

I think NOUN/NNS with NumForm=Digit|NumType=Card

Agreed, I wanted to add this is how LDC corpora behave as well, and it looks like GUM also follows this without exception.

How hard would it be to add a feature in the MISC column to flag years

If you need test data for a heuristic, note that GUM has absolute time expression resolution, so if something is a year you can see that in the <date when="yyyy(-mm-dd)?"> XML annotations, which also appear in MISC.

Also if you want to do this automatically using NLP, @nitinvwaran built a nice ensemble tool over HeidelTime and a couple of other resolvers to tag the AMALGUM corpus, which you can find here

nschneid commented 2 years ago

@amir-zeldes thanks, I am not proposing we do full date tagging right now, just that if @AngledLuffa has already identified the years in the course of adding numeric features, we might as well mark them as years for future reference

amir-zeldes commented 2 years ago

I am not proposing we do full date tagging right now

No, I didn't think that - I just meant you could do a sanity check (for precision) and catch missing cases by running an NLP tool (for recall), or test the heuristic on GUM if you like.

AngledLuffa commented 2 years ago

I am not proposing we do full date tagging right now, just that if @AngledLuffa has already identified the years in the course of adding numeric features, we might as well mark them as years for future reference

That is not what I did, though... years which were entirely digits just got caught up in the regular expression to label all tokens which were entirely digits. The ones with apostrophes or s at the end are separate, but easy enough to find again

nschneid commented 2 years ago

OK then never mind about a MISC feature.

Is this ready to merge? Can I squash?

AngledLuffa commented 2 years ago

Suggest not squashing, but good to merge