UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
274 stars 249 forks source link

Annotations for adjectives referring to proper nouns vs common nouns #994

Closed rhdunn closed 1 week ago

rhdunn commented 1 year ago

Currently, there is no way in the UD English treebanks to differentiate between adjectives that refer to common nouns and those that refer to proper nouns -- both are annotated as ADJ+JJ.

This makes the lemmatization rules inconsistent where exceptions need to be defined for all the proper noun derived forms. This covers things like:

  1. references to places -- e.g. "the French revolution";
  2. references to religions -- e.g. "the Jewish community";
  3. references to political parties/alignments -- e.g. "the Republican party";
  4. references to people (esp. scientists) -- e.g. "in Lagrangian mechanics".

It would be helpful in terms of things like validating lemmas to have common and proper noun adjectives differentiated e.g. in a feature annotation. I'm not sure of any cases where the distinction is necessary (i.e. if there are cases where a common and proper noun have the same form text). Therefore, this annotation would be lexical in the same way the NumForm is for numbers.

nschneid commented 1 year ago

Not sure I follow—could you elaborate on what this has to do with lemmatization? Are you referring to capitalization or something else?

sylvainkahane commented 1 year ago

The remark is that the distinction between proper vs common nouns also exist for adjectives. But it is something which is essentially semantic, because I don't think that adjectives referring to "proper noun" have different syntactic properties. By the way, it is also not very clear that the distinction between proper and common nouns is syntactically motivated in most languages.

rhdunn commented 1 year ago

Yes, I'm referring to whether or not the lemma is capitalized. In common noun adjectives (west, etc.) the lemma should be lower case, whereas for proper noun adjectives (e.g. Saxon) the lemma should be capitalized.

For the other parts of speech and features it is possible to determine the casing that the lemma has. For adjectives, I'm needing to build a list of the adjectives that are capitalized/proper noun based.

nschneid commented 1 year ago

From a theoretical perspective in terms of relevance to morphosyntax, I don't see a strong need for the distinction (as @sylvainkahane points out it is semantic).

From a practical perspective, if it would essentially be redundant with the lemma being capitalized, I'm not sure what it buys us. It is easy enough for a script to check whether the lemma is capitalized. Building the list of capitalized adjectives would be necessary anyway to implement the new feature.

rhdunn commented 1 year ago

So IIUC for cases where the adjective is a common noun or based on a common noun (White, West, Eastern, etc.) the lemma should always be the lower case form, and when it is a proper noun then it should always be capitalized?

leky40 commented 1 year ago

What about the languages do not have capital letters?

amir-zeldes commented 1 year ago

I think in practice the lemma of adjectives is capitalized where the normative, prescriptive orthogtaphy of the lanuage would capitalize them. We capitalize "French" in "the French language" because most people spell it that way, and an average English tutor would probably mark it an error if we wrote "the french language" - so "French" is considered the canonical form in this context. I don't think it corresponds to a strongly motivated syntactic distinction, it's just a lexicographic distinction.

nschneid commented 1 year ago

If we were doing semantic annotation it would be reasonable to have an entity type feature. Some UD treebanks have this in MISC. There are certainly orthographic differences between languages that this would normalize across (e.g. in French I believe capitalization is used for names of persons but not languages, which would standardly be capitalized in English).

If there is a morphosyntactic category of proper adjective in some language, e.g. different affixes, then it would make sense to have a morphological feature (but maybe it wouldn't apply to English).

So IIUC for cases where the adjective is a common noun or based on a common noun (White, West, Eastern, etc.) the lemma should always be the lower case form, and when it is a proper noun then it should always be capitalized?

See https://github.com/UniversalDependencies/UD_English-EWT/issues/131#issuecomment-787093974. We haven't developed 100% complete guidelines for English lemmatization but that is where we currently are.

rhdunn commented 1 year ago

I'm working on a lemma validator (https://github.com/rhdunn/conllu-en-validator/blob/master/validator/lemma.py) for my English validation script. The adjective question came up as a result of that -- I'll make use of the lemma exceptions to handle capitalization.

I'm going to raise issues identified by this script once I have a full set of lemma validators so I can group the issues by things like missing CorrectForm annotations, POS tagging errors, etc..

rhdunn commented 1 year ago

I'll take that issue into consideration for my validator.

dan-zeman commented 1 year ago

I agree that the distinction is mostly semantic and it is definitely not syntactic, although proper capitalization is part of the grammar, so the semantics actually has some impact on the surface form. The same situation is with PROPN vs. NOUN in Czech: As the two tags are defined in UD, we do distinguish proper and common nouns to increase parallelism with other languages, but there is hardly much more to it than semantics and capitalization.

If there were a feature to distinguish "proper adjectives" and we wanted to use it, we would face the problem how to delimit it. Except for possessive adjectives (e.g. Havlíčkův "Havlíček's", derived from a surname), adjectives derived from proper nouns are not capitalized (and I believe this is the case in many other languages): francouzský "French", židovský "Jewish" etc.

Note that morphological derivation can be optionally captured in MISC and we do it in Czech, using the LDeriv MISC attribute:

6  Havlíčkově  Havlíčkův  ADJ  _  Case=Loc|Gender=Fem|Gender[psor]=Masc|Number=Sing|Poss=Yes … LDeriv=Havlíček
jnivre commented 1 year ago

For me this clearly goes beyond the morphosyntax that UD can be expected to cover. This doesn't mean that it may not be worth annotating, but I then think it should go into an additional annotation layer on top of UD (like existing efforts to annotate multiword expressions and co-reference), or possibly into the (notorious) MISC column. It should definitely not go into the basic part-of-speech tag distinctions.

amir-zeldes commented 1 year ago

Agreed, I don't think this is really a universal feature - each language has different conventions for what exactly is capitalized.

@rhdunn you could try using a lexicographic resource to establish whether a lemma normally has/allows capitalization. Both dictionary.com and WordNet capitalize the entry for French, including as an adjective, for example:

Stormur commented 11 months ago

In Latin we experimented with NamedEntity assigning it also to ADJs, but this feature will be discontinued and made relive under other forms. This technique exploits a kind of "tag polysemy", in that NamedEntity tied to referential POSs (NOUN/PROPN) means that they are words indicating to such entities, while tied to relational POSs (ADJ) they refer to some property of those entities.

I agree with many other comments here above in that no particular annotation is needed at a morphosyntactic level and spelling rules might be "clues but not leads", HOWEVER I would like to point out, as one of the possible reasons of the main poster's perplexity, that a subclass of these adjectives is indeed a straining point of the usual distinction between inflection and derivation. All these formations are so regular that one wonders if we could just do something like

Are possessive forms in Czech, e.g. karlův/karlova/... from Karl different in any way from the English adjectiviser -an? How they do not belong to the same lexeme?

I understand the caution used by UD for these formation, but using MISC with LDeriv or similar just seems to me like pushing very relevant paradigmatic information to the margin.

Yes, this really opens up a true can of worms, but maybe we could start experimenting at least from this subclass of "proper adjectives"?

martinpopel commented 11 months ago

In Latin we experimented with NamedEntity

I would suggest to use the Entity annotations instead (coreference can be added here as well, it is supported by the validator using --coref and by Udapi).

Are possessive forms in Czech, e.g. karlův / karlova /... from Karl different in any way from the English adjectiviser -an?

These possessive adjectives derived from proper nouns (personal names) are capitalized in Czech, i.e. Karlův / Karlova (and the lemma Karlův is thus capitalized as well). The English translation is Karel´s (if it was the Czech name, or Karl's in case of the German name). The English (Latin origin) suffix -ian (e.g. Charlesian) would rather correspond to Czech suffix -ovský (e.g. karlovský) and this kind of adjectives is not capitalized in Czech and is more commonly derived from surnames, e.g. čapkovský styl means the style of Karel Čapek.

How they do not belong to the same lexeme?

That's the tradition of keeping separate lemma for adjectives and nouns even if the derivation is very regular. That said, this tradition has been broken on the tectogrammatical layer of annotation (of PDT and PCEDT), where it has been decided that Karlův will have t-lemma Karel. However, it is not the goal of UD to go so deep as the t-layer.

Lagrangian NOUN (or PROPN) with lemma lagrange

Lagrangian is perhaps not the best example because it can actually be a noun, in which case I think it should have lemma Lagrangian (BTW: The Czech translation is lagrangián or even lagranžián.)

I understand the caution used by UD for these formation, but using MISC with LDeriv or similar just seems to me like pushing very relevant paradigmatic information to the margin.

I consider LDeriv quite appropriate. Of course, it cannot cover all the peculiarities of derivational morphology, but that goes clearly beyond the UD project, and there are related projects for this, e.g. UDer.