UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
272 stars 247 forks source link

Inflectional classes: which labels? #748

Open Stormur opened 3 years ago

Stormur commented 3 years ago

I'm probably preceeding @dan-zeman in the quest for standardising features, and hope not to duplicate older issues (I couldn't find others by using the inflection keyword).

So, I was wondering which (language-specific) features could be used to mark inflectional classes of words, and if there exists any already.

It is true that such information may not be purely morphological, but rather lexical, as it is often arbitrary from a synchronic point of view: it is not necessarily dependent on a given part of speech nor gender, even if, at the same time, there exist more or less strong correlations. It is this orthogonality that would motivate such feature, and not secondarily the fact that we often find this information in the Latin treebanks we are taking care of, and it would be a pity to lose it during conversion into the UD standard.

We were experimenting with the already existing NounClass, envisioning a corresponding VerbClass (not yet attested), in parallel to Variant or Form, both already existing, mainly for Czech and Irish.

One thing that brought us towards the for now Bantu-only NounClass is that, from its description: 1) we are in the same realm of lexical properties expressed through morphology; 2) a language like Latin already has the independent Gender feature; 3) there is a synchronic unpredictability (as explained for Wolof).

What sets Latin classes apart from Bantu ones is that they are not related to concordance phenomena... I don't know how much this is relevant. I have to admit, I am still a little bit confused about the distinction between NounClass and Gender, but if they are indeed considered to be separate phenomena, our first concern was to reuse something that was already there. On the contrary, if they aren't, how much possible is it to let Gender become just a subtype of NounClass?

Other considered features: NounType and VerbType are already taken and are meant for different things; there is a mysterious Uninflect for uk (Ukrainian?).

I am curious to know what the UD community thinks about this, if there are any suggestions or proposals, and if some treebank has already dealt with such issues! :slightly_smiling_face:

dan-zeman commented 3 years ago

The noun classes in Niger-Congo languages, as I understand them, are something else than inflectional classes in Indo-European languages. While they do correspond to nominal inflection at least in Bantu, they are also a property of nouns that can be cross-referenced by other words, most notably verbs. That makes them important in the language system, beyond morphology.

BTW we have copied the values Bantu1-Bantu23 from the specification of UniMorph but they have not been used in a UD treebank yet (we still don't have Bantu languages). In contrast, there are also 12 classes for Wolof (Wol1-Wol12), and these have been applied to real data.

Stormur commented 3 years ago

OK. This is somewhat unfortunate, because class is really the right generic term for inflectional paradigms.

Do you have other naming suggestions? What do you think of:

Neither yet exists at the moment.

dan-zeman commented 3 years ago

Why not simply InflClass?

But I'm a bit skeptical on maintaining one set of values across a family. Perhaps it would be enough to keep the name of the feature cross-linguistically, while the values would always be language-specific. Anyways, we have yet to see how many other people will actually want to use it. For instance, the feature would be perfectly relevant for Czech but I don't have the data in the original annotation and I'm not going to try to obtain it.

Stormur commented 3 years ago

Why not simply InflClass?

There should be a distinction between verbal and nominal inflections.

But I'm a bit skeptical on maintaining one set of values across a family. Perhaps it would be enough to keep the name of the feature cross-linguistically, while the values would always be language-specific. Anyways, we have yet to see how many other people will actually want to use it. For instance, the feature would be perfectly relevant for Czech but I don't have the data in the original annotation and I'm not going to try to obtain it.

I think it is better to open it to the widest possible applications! :slightly_smiling_face: We have this difficulty for some Latin treebanks, too, but I feel that at least the potential of having such a feature is a good thing.

PS: I think the universal label is still valid!

dan-zeman commented 3 years ago

Why not simply InflClass?

There should be a distinction between verbal and nominal inflections.

Why? Do you intend to combine both on one word? And what would you do with inflections that are neither nominal nor verbal? If the distinction is necessary, it is also possible to start values of nominal inflection with "N" and verbal with "V".

Stormur commented 3 years ago

Yes, as I explain in the first post: inflectional classes might stack! In Latin, this happens for participles:

The fact is that both pieces of information are relevant from an inflectional point of view.

I can imagine that other kinds of stackings can happen in ways that I cannot fathom, of course not limited to verbal conjugation + nominal declension. Probably, also multiple nominal inflectional paradigms can happen at the same time. This was my original motivation for having two features. But it has occurred to me that we may solve this, allowing for as great as possible flexibility, the following way, starting from your proposals: