UniversalDependencies / UD_English-GUMReddit

Other
1 stars 2 forks source link

CD X,000 is not really NumForm=Word #14

Closed rhdunn closed 12 months ago

rhdunn commented 12 months ago

NumForm=Word is used for words like "five". The following looks to be a NumForm=Combi due to the combined letters and numbers:

ERROR: Sentence GUM_reddit_callout-10 token 11 -- CD/NumForm=Word lemma 'X,000' does not match lowercase-form applied to form 'X,000', expected 'x,000'
amir-zeldes commented 12 months ago

Not sure about this one... I mean, X is a variable here so it's more math than letters. But it would be simpler to just go by orthographic criteria...

On the other hand, and this is the reason for the error, Combi is currently only used for ordinal numbers. If we change this, it would be the only Card + Combi in the entire corpus, not sure we want to allow it just for one case? Thoughts @nschneid ?

nschneid commented 12 months ago

It's a weird, nonstandard case - not a standard combined form like ordinals. I'd probably call it Card because X is a placeholder for a digit.

amir-zeldes commented 12 months ago

Card is fine, but the question is whether it's Combi. That would be the only Combi Card in the corpus then, and I'm not sure this really justifies that (it really behaves as if X is a number, it doesn't spell a number using letters)

nschneid commented 12 months ago

Yeah I'd prefer the simpler value (just Card). Let's not create a whole new class of numbers just for this one weird example.

rhdunn commented 12 months ago

So would it make sense for the lemma to be X000 -- i.e. remove the , from the form -- like is done with the other NumType=Card|NumForm=Digit tokens? In that case, it would be a special exceptional variant of Digit as X is being used as a placeholder for any digit value.

nschneid commented 12 months ago

Sorry I always get NumForm and NumType confused. I thought the proposal was NumType=Card,Combi, a new hybrid value of one feature. I think NumForm tends to be a pretty superficial feature defined in terms of character sets, not what the characters represent, so NumType=Card|NumForm=Combi would be fine.

amir-zeldes commented 12 months ago

Hm, fine - done