UniversalDependencies / UD_German-GSD

Other
18 stars 5 forks source link

Lemmatization of ordinal numbers #24

Closed yolpsoftware closed 2 years ago

yolpsoftware commented 2 years ago

There seem to be some inconsistencies in the handling of ordinal numbers. Some ordinal numbers are lemmatized as an adverb with the period (word="21.", lemma="21.", pos=ADV), some as an adverb without the period (word="21.", lemma="21", pos=ADV), and some split the number and the period, treating them as NUM and PUNCT. Just looking at the "dev" dataset:

Furthermore, the days of months situation is IMHO very similar to the English case:

Am 27. Mai On May 27th

so I would expect them to have the same treatment. Both are dates, and in both cases, the day is an ordinal number meaning "the 27th day of May".

However, in the following English dataset, days of months seem to be lemmatized consistently as NOUN (just search the dataset for "1st", "2nd", "3rd" etc.):

https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-dev.conllu

Examples:

I was on vacation from October 4th to October 19th and I didn't submit my timesheet yet. My boyfriend's birthday is November 22nd and we are going to Del Frisco's for dinner. Just to let you all know Matt has confirmed the booking for 3rd Dec i s OK.

Shouldn't the German day-of-month ordinal numbers be treated as nouns too? What's the difference to the English case?

dan-zeman commented 2 years ago

Shouldn't the German day-of-month ordinal numbers be treated as nouns too? What's the difference to the English case?

They should be treated similarly. However, the current annotation in English is wrong. They are definitely not nouns. Ordinal numerals are generally tagged as adjectives, with the additional feature NumType=Ord (see ADJ).

The inconsistencies in German should now be fixed in the dev branch. The fixes will be propagated to the next UD release.