UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
272 stars 247 forks source link

many, few: DET vs. ADJ #786

Open nschneid opened 3 years ago

nschneid commented 3 years ago

I am hesitant to bring this up after the long discussion at UniversalDependencies/UD_English-EWT#170, but I just noticed that the guidelines stipulate that quantifiers many and few should be tagged as DET, whereas the English corpora (EWT and GUM, at least) use ADJ.

In English, at least, various distinctions can be drawn between quantifiers:

amir-zeldes commented 3 years ago

Good point. I agree with your arguments why they should be ADJ, and I imagine that's why they are JJ in PTB tags. Modification with "very/so" is also a strong piece of evidence. And you can also see the typical PDT-DT-JJ pattern in NPs like "all the many reasons". Additionally, there are constructions where another adjective may precede "many" & co, which is rare but also suggests they are also adjectives, e.g.:

And so on. And occasionally they are repeatable like adjectives:

So I think these really are better described as adjectives, and the guidelines should be revised.

nschneid commented 3 years ago

Additionally, there are constructions where another adjective may precede "many" & co, which is rare but also suggests they are also adjectives, e.g.:

  • For the next several days , she looked up all the scriptures she could find that used the word study or meditate

Good point. However, it may be worth noting that cardinal numbers can also go in the quantity slot—"the next 3 days"—and I can't think of canonical adjectives that work in this slot (*the next numerous days). So this may just be a special construction which puts an ADJ before a narrow subset of quantity modifiers, not ADJs in general.

And occasionally they are repeatable like adjectives:

"Much" also fits here, even though unlike "many", "few", and "several", it cannot combine with the definite article: *the much sugar.

This repetition construction seems to be a form of intensification, so I suspect that it goes with the degree semantics of these words, and would give the same predictions as the degree modifier test.

Obligatory consultation of CGEL, pp. 539-540:

image

So, heading partitives is considered an essential test for determinatives. The so-called "degree determinatives" are modifiable with "very" etc., like adjectives, but unlike normal attributive adjectives also license "so" modification:

image image

(For completeness I included the bits about "sufficient" and exclamative "what". The DET/ADJ split for "sufficient" to me seems overly subtle for UD; I would keep it just ADJ. The part about exclamative "what" pertains to UniversalDependencies/UD_English-EWT#103.)

But back to the question in this issue: for "many", "few", "much" (ADJ/ADV currently), and "several", do we go with the guidelines and CGEL or with the status quo annotations? I suppose it is just a question of which tests to prioritize from a UD perspective.

If UD had a hard rule that DETs cannot be modified, it would be one thing; but we have advmod for "nearly/ADV all/DET", so allowing degree-modifying adverbs of DETs ("very many", "so few") wouldn't be a huge departure.

Is there another UD principle that should apply here?

sylvainkahane commented 3 years ago

OK guys, here we face a big problem. The current definition of DET on the UD page is completely semantic. It is even said that languages without determiners such as Czech must use the DET tag for words which are tagged DET in other languages "in order to annotate the same thing the same way across languages".

I really dislike this page on DET. I think that POS should be defined as distributional classes and not as semantic classes. I agree with you that many and few are distributionally adjectives. But I you want to tag them ADJ you must completely change the current definition of DET on the UD page and you must verify that the owners of the 180 UD treebanks agree.

nschneid commented 3 years ago

Pronominal words are pronouns, determiners (articles and pronominal adjectives), pronominal adverbs (where, when, how), and in traditional grammars of some languages also pronominal numerals (how much)....We understand the DET class as pro-adjectives, which is a slightly broader sense than what is usually regarded as determiners in English. In particular, it is possible that one nominal is modified by more than one determiner.

I gather that the criteria for DET are closely related to the PronType feature. @dan-zeman, is the idea that every DET token should have a PronType?

https://universaldependencies.org/u/overview/morphology.html#pronominal-words notes

the borderline between indefinite determiners and adjectives is slightly fuzzy. Related languages should synchronize the lists of words they treat as pronominal.

So, in these guidelines I don't see a syntactic justification for why "many", "few", "several" should be considered pro-adjectives/determiners rather than full adjectives. By the tests laid out earlier in the thread they seem borderline. I think it may be necessary to develop and document language-specific criteria.

nschneid commented 3 years ago

Let me try to formulate the two positions:

Position 1:

Position 2:

What about "hardly any" and "nearly all"? "Any" and "all" are currently tagged as DET, but does the fact that they can be modified contradict Position 1?

amir-zeldes commented 3 years ago

I personally think the arguments for ADJ are stronger, but I also think the guidelines don't directly predict what should happen in any given language, since they say "...quantifiers (words like many, few, several), which are included among determiners in some languages".

So I think the only problem here is that the universal guidelines are referring to specific English lexical items, but I don't think the intention is to regulate English guidelines in the universal ones, no? In other words, we could say "quantifiers with senses like 'many, several'" may be DET in some languages", and we can decide that these specific items in English are actually ADJ, due to the various tests we've been referencing (not to mention maintaining consistency with PTB if there's no overwhelming need to diverge).

Stormur commented 3 years ago

I think on the contrary that many, few and so on are clear examples of DET. The key point is that they are functional words not dissimilar from that, both and so on. They just determine an indefinite (hence, I think, PronType=Ind) quantity (hence I think they would receive NumType=Card). There's nothing else to them. The behaviour of possible repetition, distribution with respect to other elements like very and so on are probably determined by semantical factors: there's really no way that some can have a higher degree (but cannot we say only some?), it is "sparse" by its own nature; whereas even an indefinite quantity leans itself well: the same way I can say 10, 1 000, 1 000 000, I can grade the intensity of a many.

I really dislike this page on DET. I think that POS should be defined as distributional classes and not as semantic classes. I agree with you that many and few are distributionally adjectives. But I you want to tag them ADJ you must completely change the current definition of DET on the UD page and you must verify that the owners of the 180 UD treebanks agree.

But the problem is that the distinction between ADJ and DET is and needs to be mostly semantic, and personally I have actually been finding it clearer and clearer the more I have been annotating. You can see both as a same class: then, the lexical ones therein that we want as heads are labeled ADJ, the functional ones with only deictic qualities or similar that will depend as leaves are DETs. Of course everything has a meaning, but there is still a difference between these two facets. And naturally some border cases can occur, especially when a lexical ADJ slowly becomes grammaticalised as a mere determiner, where uncertainty may arise. But I think this is not the case for many or few.

The problem is that a pure distributional approach to ADJ/DET/... labelling would retrieve lots more classes, based on small, rather arbitrary variations, like the occurrence with a specific adverbial modifier, and so on. Because in the end ADJs and DETs as a whole are substantially distributed the same way. Another problem of a purely distributional approach is that it risks becoming purely contextual: for example, I do not think it makes sense labelling an adjective as a noun only because it appears without head noun, if it has not been fixed into a particular sense (e.g. it is always meant with a particular gender, and so on).

sylvainkahane commented 3 years ago

In some languages, we have quite clear distributional classes. In French, a common noun cannot appear as a subject without some particular elements (contrary to English where a plural noun can appear alone). It is such particular elements that I would call determiners and I would like to tag DET. I don't need any semantic criterium to do that.

More exactly there are four distributional classes in French:

The distributional class of quasi-determiners is the most tricky class. Except for the numerals, it includes vey few elements: quelques 'some/few', différent 'various' and divers 'various'. These elements are sometimes tagged ADJ when they are preceded by a definite determiner, but they are mostly tagged DET in the UD treebanks. (Note that différent can also be used post-nominally with a different meaning: ces différentes tables 'these various tables', ces tables différentes 'these different tables (different from one another)'.)

Anyway, in French we have a quite clear distributional criterium to cut between DET and ADJ: a DET is required and cannot be erased. This criterium must be refined for quasi-determiners to decide whether quelques is DET or ADJ in ces quelques tables: If we consider that both ces and quelques can be suppressed, but not both of them simultaneously, both ces and quelques must be tagged DET in ces quelques tables. Not a word of semantics is needed!

amir-zeldes commented 3 years ago

I think the situation in English is fairly similar, even if there are bare plurals in the language: determiners are special in appearing before adjectives (green the grass), being unrepeatable (green green grass/the the grass) and of course they are not modifiable by the typical intensifiers (so/very). I think words like "many, few" are less prototypical as adjectives and don't pass every ADJ test, but they certainly don't pass all DET tests either, so I think there is some room to apply judgment here, and the criteria should be morphosyntactic, not semantic. For "few" in particular, the presence of a morphological comparative "fewer" should be added to the list of indications that it is better classified as an ADJ than a DET.

Because there are many more morphosyntactic categories than UPOS labels, I think it's normal for us to lump some things together that don't pass each and every test in the same way, and some might argue that almost every word and construction is a sui generis at some level of resolution (e.g. Bill Croft's Radical Construction Grammar, which argues very strongly that POS categories don't really generalize, and definitely not across languages).

nschneid commented 3 years ago

Right, some language-specific criteria are inevitable because categories are never clear-cut. I assume that the borderline cases we've discussed are evidence of grammaticalization. Probably "few" started out as a normal adjective and has acquired some properties of determiners. I wonder if there was ever a "manyer".

amir-zeldes commented 3 years ago

I wonder if there was ever a "manyer"

No, AFAIK there wasn't, though one could say that "more" is a suppletive comparative to "many". Etymologically this root has a long history of both adjectival and determiner uses: in Russian it produces what looks like an adjective morphologically "mnogij" 'many', while in modern German, the cognate "manch(e)" 'some' is more unambiguously a determiner, with xpos "PIAT" for "pronoun, indefinite, attributive". I was curious what they did with that for upos and discovered that while it's universally det, it is upos PRON in GSD, but upos DET in HDT!

So it looks like these items create headaches in multiple languages, but I definitely think the decision should be made per language, and ideally based on morphosyntax, not semantics. Even if we don't think "more" supplies the comparative, "fewer/fewest" does exist, and I think the evidence for ADJ is stronger (plus it's the status quo, so I feel like there should be a strong reason to change this)

dan-zeman commented 3 years ago

is the idea that every DET token should have a PronType?

Yes.

dan-zeman commented 3 years ago

the universal guidelines are referring to specific English lexical items, but I don't think the intention is to regulate English guidelines in the universal ones

Yes, basically. I think I personally like the English words many, few etc. being something else than ADJ (which would lead to DET), but I don't think it is a requirement. The English-specific guidelines can say that these words are not DET but then the universal guidelines should warn about it and contrast them with examples from other languages where the classification of lexically equivalent words is different. For example, Czech mnoho “many/much” is definitely not an adjective. The Czech grammar classifies it as an indefinite numeral but we cannot use NUM in UD because that UPOS tag is specifically reserved for definite quantities. It is morphologically and distributionally close to numerals, with the exception that it has irregular comparative více “more” and superlative nejvíce “most”; these are homonymous with adverbs (but gradation of quantity is not the same thing as gradation of quality). They ended up as DET in UD, although otherwise DET is used for traditional pronouns that behave like adjectives... and as I said, these words don't. Interestingly, Czech also has the related word mnohý (cognate with Russian mnogij, mentioned by Amir above), and this one does behave like a regular adjective in most respects, except that it does not have a comparative and a superlative form. This word will actually be tagged ADJ in the Czech UD data.

sylvainkahane commented 3 years ago

@dan-zeman You said that Czech mnoho “many/much” is definitely not an adjective. What would be the criteria you use to distinguish ADJ and DET in Czech? Can you avoid a semantic criterion?

dan-zeman commented 3 years ago

@sylvainkahane The word mnoho (and a few similar words) is a special case. Its morphology and distribution differs significantly from ADJ, as well as from most other DET.

Otherwise, this is an interesting question. One thing is what we really do, the other thing is whether we can redefine the criteria along the lines you suggest. What we really do in Czech UD is we take the words that the traditional grammar classifies as pronouns and then ask whether they have adjective-like morphology. If they do, they are classified DET in UD. So the borderline between DET and ADJ is inherited from the traditional grammar and I cannot remember whether I saw a general enough explanation how the traditional grammar got it (the pronouns are often simply enumerated; I remember I had to memorize the list in elementary school).

There are certainly some specifics in the distribution of the words that we tag as DET, when contrasted with ADJ. The DET can modify a nominal (like adjectives) but some of them are more readily than adjectives used standalone, replacing a nominal (which is why they are a subclass of pronouns). DET can freely co-occur with an ADJ in the same nominal, but it is less common to have two DETs in the same nominal (still, many such combinations are perfectly grammatical). If a NOUN is modified by a DET and an ADJ, the DET will precede the ADJ. Unfortunately, all these are tendencies rather than strict tests.

Some of the subclasses (PronType) of DET are defined semantically (e.g., possessives or negatives), others have special syntactic properties as well (e.g., interrogatives and relatives). The most vague subclass seems to be the indefinites, at least when trying to compare its members with classification of their counterparts in other languages. For example, the word jiný “other” is an adjective in Czech, but I have seen its counterparts in other languages tagged as PRON or DET and it makes sense to me, but I suppose the intuition is again based on semantics (more specifically on the way how other refers to an entity rather saying anything concrete about it).

Stormur commented 3 years ago

@nschneid :

Right, some language-specific criteria are inevitable because categories are never clear-cut. I assume that the borderline cases we've discussed are evidence of grammaticalization. Probably "few" started out as a normal adjective and has acquired some properties of determiners. I wonder if there was ever a "manyer".

As it has already been pointed out, there are indeed more and most (and in Latin we have the equivalent multus, plus and plurimus series)!

Even if they were not etymologically related to many, they still show that degree (in the case in a quantitative sense) can be associated elements we call determiners, albeit marginally because of the nature itself of determiners (i.e. deictic features cannot really be graded). Conversely, we can argue how much degree is associatable to undisputed adjectives like wooden or wonderful: how acceptable are more wooden or most wonderful, and how much their acceptability depends on (a very specific) context and their meaning?

While we observe degree as one of the prototypical features of what we call adjectives, even when its expression is morphologically engrained in the language and thus more or less mechanically applicable in a grammatical way to the members of a given class, we might wonder how really sensible are things like Lat. petrinior (Cmp) and petrinissimus (Abs), from petrinus 'made of stone'. While conversely we observe that forms like ipsissimus 'the very oneself (an no one else)' (Abs) have developed from the determiner (traditionally "pronoun", indeed historically veering towards a pure 3rd person pronoun) ipse '(one)self, same'. This is just to argue about the decisiveness of single features: there is a permeability between classes which seems to be semantically, and not merely distributionally, determined.

@sylvainkahane :

[...] while in modern German, the cognate "manch(e)" 'some' is more unambiguously a determiner, with xpos "PIAT" for "pronoun, indefinite, attributive". I was curious what they did with that for upos and discovered that while it's universally det, it is upos PRON in GSD, but upos DET in HDT!

@dan-zeman :

The most vague subclass seems to be the indefinites, at least when trying to compare its members with classification of their counterparts in other languages. For example, the word jiný “other” is an adjective in Czech, but I have seen its counterparts in other languages tagged as PRON or DET and it makes sense to me, but I suppose the intuition is again based on semantics (more specifically on the way how other refers to an entity rather saying anything concrete about it).

I am pretty sure the cases of manch and others follow the typical oscillations we observe in different annotations, and that it is indeed a case of application of purely distributional criteria:

But the problem is that, in my opinion (more and more reinforced after long runs of annotation and database searches) a similar approach risks being too mechanical and as a consequence, more importantly, is not at all informative. We really do not gain any insight apart from stating the obvious: in some cases we will find it depend as a modifier, in other cases it won't. But this is already shown by the syntactic level.

If we were to pursue this approach to what may be seen by many as an absurdum, but is just a logical consequence, than we should also label many traditionally undisputed adjectives as (pro)nouns, especially in languages which allow them to stand as heads. So:

I don't know how pursuable this approach is. For example, from a research perspective, this clouds the detection of the so-called substantival use of an adjective, which would correspond to a general query like "POS ADJ, nominal relation (nsubj, obj,...)". One could argue that one might directly search e.g. for "lemma grande, POS NOUN", but this is circular and undesirable, because: A) I have to know already the lemmas I want to look for, whereas I might be interested in the general phenomenon; B) once I have obtained some occurrences, there is nothing that distinguishes a list of such forms from "true" NOUNs, but there is a difference.

The point is that this difference cannot be detected exclusively by means of morphological and distributional features. Please don't get me wrong, I am all for using them and am not advocating for a pure semantical approach; but they cannot be the only ones, else we would end up with many more word classes (for example, the aforementioned wooden and wonderful should not be counted the same as great or hot...) and non-informative labels, because in the end these would tell nothing more than it is already represented by syntax. I think that morphological and, to a lesser extent, distributional features are helpful but just a part of the puzzle, a kind of emanation of abstract word classes which can assist us in telling whether we are on the right way or not.

Let's just look at adjectives and nouns in Latin: they substantially share identical morphological and distributional patterns. They're very often indistinguishable from these points of view: other "impartial" criteria have to be taken into account. @amir-zeldes cited Croft's Radical construction grammar, and I think that it is indeed a very useful reference and source of inspiration. But I would like to point out that in RCG it is still argued that universal word classes do exist, and the author goes on to define some of them; what he dismisses is the application of misleading language-specific terminologies in a typological study and the possibility to apply the same (morphological, distributional...) criteria to all languages - but there are universal word classes (for which we might use traditional labels like "adjective", "noun"... after we agree on a universal way to identify them).


Going back to case of few and many, from a very, very pragmatic point of view I would argue that just the fact of associating them to a PronType=Ind is a very strong evidence in favour of the label DET (because it is a sign that their meaning is functional rather than lexical).

Stormur commented 3 years ago

Anyway, in French we have a quite clear distributional criterium to cut between DET and ADJ: a DET is required and cannot be erased. This criterium must be refined for quasi-determiners to decide whether quelques is DET or ADJ in ces quelques tables: If we consider that both ces and quelques can be suppressed, but not both of them simultaneously, both ces and quelques must be tagged DET in ces quelques tables. Not a word of semantics is needed!

I agree that this is very clear-cut and identifiable for French. But I mean that a problem arises when we are widening the perspective to other languages: already with the closely related Italian, this criterium does not work anymore. And more so, these criteria overlap even less with those in English. So in the end we have to resort to something different when we want to define a universal DET class, while more "mechanical" criteria might be fine inside one specific language, when a link has been established between them and such a universal word class. And, as shown by various examples in this discussion, more often than not there is a remarkable morphodistributional variation internal to (traditional) word classes even inside single languages, so that such "third party" criteria can help. And it is according to them that I see many and few as very clear specimens of UD's DET class, despite some idiosyncracies they might show inside the English system.

nschneid commented 3 years ago

Analysis of ADJ, DET in core argument positions (EWT)

I decided to look at the distribution of words tagged as ADJ or DET that serve as subjects or objects in EWT.

The following ADJ lemmas account for 195/275 of the matches:

The remaining 80 ADJ tokens include attribute-based category references like the poor and the British; verbs which select for an adjective argument (or an adjective is mentioned/quoted: MAD means crazy); idioms like the inevitable, the obvious, the unexpected, an original (copy); ordinals; superlatives; and annotation errors.

The following DET lemmas account for 206/230 matches:

(Most of the remaining 24 matches are annotation errors.)

TL;DR There is a fairly clear closed set of modifiers with quantity semantics (the ones bolded above) that can stand alone as well as occurring prenominally, in contrast to most lexical adjectives. So I don't think it would be crazy to lump them under DET on that basis.

amir-zeldes commented 3 years ago

For the ADJ cases, I think "the latter" is syntactically doing the same thing as "the poor" - it's an ADJ standing in as the head of an NP, so it can take regular argstr functions. Since we retain ADJ for "the poor" and since they are all tagged xpos=JJ.?, I would just leave them as ADJ - I don't think there's any principle preventing ADJ from serving an argstr role and I think in a lot of languages it's quite common (it's just English that really likes sticking a "one" as the head of the NP)

nschneid commented 11 months ago

Revisiting this thread: is anybody satisfied with the universal DET guidelines vis-a-vis the boundary with ADJ? Whatever the policy for "many" and "few" (currently the status quo of ADJ is winning), I suspect the DET guidelines should explain that specific tests need to be developed on a language-specific basis and that modifiability by an intensifier, for example, could weigh in favor of ADJ.

Regarding English specifically, it says "Determiners under this definition include both articles and pro-adjectives (pronominal adjectives), which is a slightly broader sense than what is usually regarded as determiners in English." It's hard to square this statement with what we are actually doing in English, which is following PTB's narrower conception.

jnivre commented 11 months ago

I think the guidelines are pretty okay. This is a thorny area with lots of borderline cases, and every language has its special quirks. The fact that “many” and “few” can take the intensifier “very” makes them special, but with respect to their distribution otherwise they seem very similar to words like “some” and “all”:

  1. a. Some students do that. b. Some of the students do that. c. Some do that.

  2. a. Many students do that. b. Many of the students do that. c. Many do that.

I think it would be natural to analyze these constructions in a parallel fashion (with “nmod” for the “of”-construction as discussed yesterday). I also think it would be natural if “some”/”many” was tagged DET in the a-examples but PRON in the b- and c-examples, as in many other languages, but I realize that it is another PTB-inherited quirk that “some” is currently tagged DET in all three cases.

In Bill Croft’s taxonomy of constructions, these are all selective modifiers (as opposed to subcategorizing modifiers, which is the canonical type for adjectives).

LarsAhrenberg commented 11 months ago

To @nschneid's question: I think the guidelines on DET, and actually ADJ also, lack examples of how to distinguish the two. Moreover, I find that the sentence Unlike in UD v1 it is no longer required that they are told apart solely on the base of the context. The words can be pre-classified in the dictionary as either PRON or DET, based on their typical syntactic distribution (and morphology, when applicable). could be extended with examples of what is regarded as 'typical syntactic distribution'.

As for DET vs PRON I agree with @jnivre. Although "it is no longer required that they are told apart solely on the base of the context." it is still allowed, isn't it?

As for DET vs ADJ it is typical for adjectives to allow adverbial modification but non-typical to take part in partitive constructions:

  1. a. Clever students do that. b. Clever of the students do that. c. Clever do that.

So it would be good to extend the guidelines with more examples and perhaps a recommendation on how to decide on typical syntactic distributions.

dan-zeman commented 11 months ago

The sentence about "typical syntactic distribution" on that page is about the borderline between DET and PRON, not between DET and ADJ. There are some example rules/tests that can be used to it but there are not many example words from concrete languages. I could add some.

Although "it is no longer required that they are told apart solely on the base of the context." it is still allowed, isn't it?

I believe it is allowed, although I prefer to avoid it in my data. But it can hardly be banned – both because some people/datasets want to make the context-based distinction and because a ban probably still could not be absolute ("same string cannot be annotated sometimes DET and sometimes PRON"), it would have to define how precisely we recognize ambiguities which should be exempt from the ban, so it would be difficult to formulate.

Regarding examples – I am all for adding examples from as many (and diverse) languages as we can collect. I expect quite an interesting mix :-) I could start with what I said about Czech in my post above from 2 years ago: mnoho "many" is not an adjective because it cannot show agreement with a noun. It even requires the counted noun to be in the genitive form (as if the partitive, "many of X", was the only option). On the other hand, it has (irregular) comparative and superlative forms více "more" and nejvíce "most", which makes it closer to adverbs and adjectives, while most determiners cannot be compared. The comparative and superlative forms are actually ambiguous because the same forms are also used as the comparative/superlative of the adverb velmi "very"; but velmi cannot denote quantity, and mnoho is unlikely to occur as a degree adverb. The school grammar classifies it as an indefinite cardinal numeral but it cannot be NUM in UD because it is indefinite. As an indefinite quantifier, the UD guidelines place it in the DET category. It is not perfect and the word differs in many aspects from the other Czech words that are tagged DET, but it is the closest match I can see. Interestingly, there is also a related word mnohý "numerous", which does behave like adjective and the school grammar says it is an adjective. The Czech treebanks follow the school grammar and tag it ADJ. But it is an indefinite quantifier, similar to některý "some", which we tag DET. So I'm starting to think it should be DET, too.

nschneid commented 11 months ago

Shall we make a table of quantifiers in various languages and their properties?

Language Lexemes Substitutive Partitive Gradable Degree Mod Post-Article Pred Quantity Agreement Tag
English all, both, some, none + + - ~ (almost all/none) - ~ (That is all) N/A DET
English much, many, more + + - very much/many, much more the many rooms ~ (The exceptions/?books are many.) N/A ?
English few + + fewer, fewest very few the few rooms ~ (The exceptions/?books are few.) N/A ?
English numerous (-) - more/most numerous very numerous the numerous rooms + N/A ADJ
sylvainkahane commented 11 months ago

In French, we have a small class of lexemes which are also between DET and ADJ. Unlike English, French has an indefinite plural article des which is compulsory if there is no other determiner (in many syntactic position including the subject position): des beaux chevaux 'beautiful horses'. But this article cannot cooccur with numerals (deux chevaux 'two horses'), as well as with a very small class of lexemes: quelques 'some', différents 'different', and diverses 'various'. But numerals, as well as these lexemes, can cooccur with the definite determiners (les 'the', ces 'these', and the possessives, such as mes 'my'). We don't know whether these lexemes must be tagged DET and ADJ, and the traditional (and unsatisfying) analysis is to tag them DET in positions where a determiner is required and absent, and ADJ elsewhere: see https://universal.grew.fr/?custom=656b3ba50d52c. I don't have a clearly better solution. In fact, they belong to the distributional class of numerals, but I suppose that it would not be UD-acceptable to put them in the NUM class. Any suggestion is welcome. Here is a paper in French I wrote long time ago on this.

nschneid commented 11 months ago

Let me see if I understand: a small class of words like différent can follow a definite article (les), but alternate with the indefinite plural article des that normally appears where there is no other determiner?

Can these words be modified when there is no separate determiner, e.g. Très différent chevaux sont...?

My interpretation of UPOS is that it tries to be fairly lexical, i.e. if a word has a primary function and a related/extended secondary function without a sharp change in meaning, we generally apply the tag that is suggested by the primary function, and allow the deprel to distinguish the two functions. So the solution here MIGHT be to say that différent is always an ADJ but it can be either amod or det depending on whether it is preceded by an article. This would mean allowing a small class of ADJs to serve as det when there is no other determiner. (Does the validator allow this? It would parallel what we do for case, e.g. VERBs like "given" can serve as case in English.)

Another option would be to make an exception to the general rule about where a determiner is required. (With current annotations, I see several cases where a plural noun functioning as subject or object lacks a det/nummod/case. Some are names or metalinguistic mentions. My French is not good enough to explain why some of the others have no determiner.)

sylvainkahane commented 11 months ago

Let me see if I understand: a small class of words like différent can follow a definite article (les), but alternate with the indefinite plural article des that normally appears where there is no other determiner?

Yes.

Can these words be modified when there is no separate determiner, e.g. Très different chevaux sont...?

No. By the way, I made a mistake of translation, différents, when it is pre-nominal, must be better translated by 'various'. And it can only be plural. There is also a post-nominal différent, which can be singular and plural, meaning 'different'.

My interpretation of UPOS is that it tries to be fairly lexical, i.e. if a word has a primary function and a related/extended secondary function without a sharp change in meaning, we generally apply the tag that is suggested by the primary function, and allow the deprel to distinguish the two functions. So the solution here MIGHT be to say that différent is always an ADJ but it can be either amod or det depending on whether it is preceded by an article. This would mean allowing a small class of ADJs to serve as det when there is no other determiner. (Does the validator allow this? It would parallel what we do for case, e.g. VERBs like "given" can serve as case in English.)

I agree that there is one and only one pre-nominal différents. I am ok to tag it as ADJ every time. I have also a question about the lemma, which is related to issue #999: does the pre-nominal différents have a lemma différent or différents? French dictionary only have one entry différent for both senses.

Another option would be to make an exception to the general rule about where a determiner is required. (With current annotations, I see several cases where a plural noun functioning as subject or object lacks a det/nummod/case. Some are names or metalinguistic mentions. My French is not good enough to explain why some of the others have no determiner.)

No, this is not a good idea. There are indeed syntactic constructions where the lack of a determiner is possible (coordinations for instance), but the case of différents is purely lexical.

nschneid commented 11 months ago

does the pre-nominal différents have a lemma différent or différents? French dictionary only have one entry différent for both senses.

Lemma decisions can be fuzzy, but if it is always tagged ADJ, then I would lean toward saying the lemma should always be différent. It's just that the singular form cannot appear in a particular syntactic context where the plural can.