Closed rhdunn closed 6 months ago
This could be standardized across (at least) European languages but the possible obstacle is that language-specific lemmatization customs differ.
Lemma definitely does not have to be all-lowercase if the base form is normally written capitalized. For example, London should not be lemmatized london. But I know that some treebanks do even this. And if we accept that the cannonical form (lemma) can contain uppercase characters, one may still question whether uppercase or lowecase is the prototypical form of Roman numerals. For me it is uppercase.
There are several examples in EWT where the roman numeral form is lower case and the lemma is the same. Those and the PUD cases would need changing to use an uppercase lemma for these.
I would be open to normalizing to one form, and would have also expected all caps. Actually looking at GUM and GENTLE, they only have uppercase occurrences, so we wouldn't need to change anything there. Did you see a lower case one?
As far as I can tell, only EWT has lower case roman numeral forms in sentences. The others all have upper case. EWT is keeping the case between the upper/lowercase forms and lemmas unchanged, while PUD is normalizing uppercase forms to lowercase.
I can certainly do whatever to PUD. No strong preference on the result, but I do like having lemmas be consistent between the treebanks.
Because casing sometimes disambiguates referents (an outline or legal document might use "III" for a large section and "iii" for a subsection), I would suggest not normalizing Roman numeral case in the lemma. Except to make it consistent across the characters ("IIi" is probably a typo meaning "III"; and "charles ii" is presumably a nonstandard way of writing "Charles II"). If we wanted to normalize in general, maybe a MISC feature with the decimal form of the number?
Lemma definitely does not have to be all-lowercase if the base form is normally written capitalized. For example, London should not be lemmatized london. But I know that some treebanks do even this.
I think this has extremely good reasons, better than keeping a capitalisation, and incidently it is what is done in (at least three) Latin treebanks.
But probably Roman numerals are slightly different, in that they are factually symbols. We should just choose if to keep them lower- or uppercase. Since the former form is used, I would probably lean towards that for more uniformity.
Because casing sometimes disambiguates referents (an outline or legal document might use "III" for a large section and "iii"
I do not tink there is any difference at all between those "3s". I am not convicned we should let "orthographic tricks" percolate into lemmatisation, that is the major point
in general, maybe a MISC feature with the decimal form of the number
This I think is needed in general for numeric values and probably, to make it include all possible values, the only current solution seems to be to put it into MISC, as discussed recently.
Looking at the English treebanks for roan numerals like
I
andIX
, EWT and GUM preserve the casing while PUD normalizes the forms to lower case.What should be used here? I would have thought that a lower case lemma would be used, like is done for nouns and other tokens.