UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

Suggestion: DET table for English #971

Closed AngledLuffa closed 1 year ago

AngledLuffa commented 1 year ago

As discussed here, I suggest we create a DET table with the known determiners and their features for English. That way, we can unify that across treebanks.

https://github.com/UniversalDependencies/UD_English-EWT/issues/416#issuecomment-1685437948

As part of this process, I also suggest including features for words such as another, possibly by inventing a new feature which matches. Currently EWT has no features on another, whereas GUM has PronType=Art.

So, for example, we have

a  Definite=Ind|PronType=Art
an Definite=Ind|PronType=Art
the   Definite=Def|PronType=Art

this Number=Sing|PronType=Dem
that Number=Sing|PronType=Dem
these Number=Plur|PronType=Dem
those Number=Plur|PronType=Dem

any   ??? labeled PronType=Ind in GUM
both ???  labeled PronType=Art in GUM
each ExtPos=PRON|PronType=Rcp   or PronType=Tot  or nothing
every  ??? labeled  PronType=Tot  in GUM

all   ???    labeled PronType=Tot in GUM
another ????     PronType=Art in GUM
some ???   PronType=Ind in GUM
no  ???    Polarity=Neg|PronType=Art in GUM

either   ???  labeled PronType=Art in GUM
neither  ???  labeled PronType=Art in GUM

such - there is a case where it has the xpos DT instead of PDT, which seems weird
yonder PronType=Art in GUM

furthermore, there are a couple instances of a typo not getting the proper features (in EWT, I believe): his/this, Thi$/this

and then there's PUD which is labeling "those" with the lemma "those" as opposed to "these"

there's also other DETs with the xpos PDT, such as all, quite, half, both, such, nary, many

nschneid commented 1 year ago

Here is a table from Quirk et al. 1985:

image

The ASSERTIVE group, discussed starting on p. 383, also includes many, a few, much, little, less, least, fewer, fewest, numerical one, half, several, enough, other, others, another.

So it seems reasonable to classify these as PronType=Ind unless a more specific feature applies (e.g. PronType=Tot).

I take it for a given word, the PronType feature should be the same regardless of whether it is tagged as DET or PRON?

Stormur commented 1 year ago

For words like another, there exists the feature PronType=Con, for "contrastive", which we are using in Latin. I think it can encompass (n)either (equivalent to Latin (ne)uter, as it were), and maybe some other elements, too.

For both I can suggest a treatment as Tot paired with a NumType. This is what we are doing for the equivalent ambo in Latin: this is also what comes out of the table with the label "count".

I think that the label PronType=Ind is overused at the moment: the fact is that more or less everything pointed to deictically could be seen as "indefinite" in a sense, but this is misleading.

dan-zeman commented 1 year ago

I take it for a given word, the PronType feature should be the same regardless of whether it is tagged as DET or PRON?

I believe it is generally true, but sometimes in some languages people treat one string as homonymous even within the PRON category, and give it two different values of PronType based on context. (Personally I prefer to avoid this approach unless its really coincidental string identity of words that also have different morphological paradigm.) If that happens, then you may want to project the distinction also across the PRON-DET boundary.

nschneid commented 1 year ago

For both I can suggest a treatment as Tot paired with a NumType. This is what we are doing for the equivalent ambo in Latin: this is also what comes out of the table with the label "count".

NumType=Card because it implies exactly 2? I like that semantically. Currently the only DET receiving a NumType in English is half/NumType=Frac I believe.

For words like another, there exists the feature PronType=Con, for "contrastive", which we are using in Latin. I think it can encompass (n)either (equivalent to Latin (ne)uter, as it were), and maybe some other elements, too.

Will this be added to the universal documentation as a possible value? If most treebanks don't use it it may be easier to stick with Ind for now than to have to make decisions about what Con means in English.

AngledLuffa commented 1 year ago

Are we any closer to a definitive answer about what to do with another? That word in particular is different between the two biggest English treebanks, and because it's so evenly balanced, the models I train from the two keep switching between whether they give another a feature or not

nschneid commented 1 year ago

It appears that PronType=Con is only implemented for Latin: https://universaldependencies.org/ext-feat-index.html#prontype

I'd rather not start using it for English unless other languages are planning to start using it. My familiarity with other languages is limited but it appears that some French treebanks use PronType=Ind for "autre" ('other'): https://universal.grew.fr/?custom=650642a6bf400 I would go with that for now in English and if there's a wider decision to adopt Con it would be easy enough to revise as it's lexical.

AngledLuffa commented 1 year ago

Makes sense. @amir-zeldes ?

amir-zeldes commented 1 year ago

Fine by me, I can add it in GUM/GENTLE. So just PronType=Ind for lemma="another" right? Any other changes needed in GUM?

AngledLuffa commented 1 year ago

Several others marked with ??? above are labeled differently between GUM and EWT: any, both, every, no, some, all, either, neither

but if we're happen with the current labels in GUM, then I suppose the change needed would be to add those features to EWT, not alter them in GUM

nschneid commented 1 year ago

Art should just be a/an/the. The rest should be Tot, Ind, or Rcp I think. Plus PronType=Neg or NumType=Card as appropriate.

On Mon, Sep 18, 2023, 3:52 PM John Bauer @.***> wrote:

Several others marked with ??? above are labeled differently between GUM and EWT: any, both, every, no, some, all, either, neither

but if we're happen with the current labels in GUM, then I suppose the change needed would be to add those features to EWT, not alter them in GUM

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/971#issuecomment-1724274057, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHQRL5RBZC4RV3V3JC2FE3X3CQ7DANCNFSM6AAAAAA3XVZUJY . You are receiving this because you commented.Message ID: @.***>

nschneid commented 1 year ago

How about the following as the features for DET (xpos DT or PDT or WDT):

Lexemes PronType Other feats
a, an Art Definite=Ind
the Art Definite=Def
this, that Dem Number=Sing
these, those Dem Number=Plur
yonder Dem
all, each**, every Tot
both Tot
any, some, another, either Ind
such*, quite*, many* Ind
half* Ind NumType=Frac
no, neither, nary* Neg
which, what, whatever Int or Rel

* Only DET as a predeterminer (N.B. I am not thrilled with the current practice of tagging "such", "quite", and "many" as DET in their predeterminer uses as it seems like unnecessary multiplication of tag ambiguity, but it follows from treating every PDT as DET: UniversalDependencies/UD_English-EWT#412) ** Except reciprocal each other: see PRON

such - there is a case where it has the xpos DT instead of PDT, which seems weird

@amir-zeldes I think this one is an error - should be JJ/ADJ

bguil commented 1 year ago

I've written a Grew-match request which "implements" @nschneid's table. It might help to have a global view of the DET that do not follow the table. For instance:

nschneid commented 1 year ago

Thanks @bguil. Looking at the queries made me realize there were issues with the last 2 rows of the table.

nschneid commented 1 year ago

Because no indicates a negation of all items in a set, should it be PronType=Neg,Tot? Or would that be confusing?

nschneid commented 1 year ago

Actually, I've changed my mind: let's go with plain PronType=Neg for no, neither, nary. Simpler is better I think (though an argument could be made that Tot and Neg are subtypes of Ind).

dan-zeman commented 1 year ago

Because no indicates a negation of all items in a set, should it be PronType=Neg,Tot? Or would that be confusing?

That would be confusing. Neg is for no, Tot is for all.

Stormur commented 1 year ago

I find PronType=Neg extremely problematic, if not misguided. I remember staring for a long time at the guidelines and comparing it to Polarity=Neg without understanding the rationale. Because what I observe is that all "negative pronouns" are pronouns/etc. of some kind with negative polarity: if we use PronType=Neg, we obscure the type. I would suggest to ditch PronType=Neg and to just use Polarity=Neg. So we will have indefinite negative pronouns/etc. as nobody = not any, contrastive negative pronouns/etc. as lat. neuter = _ne+uter "neither", and so on.

For both I can suggest a treatment as Tot paired with a NumType. This is what we are doing for the equivalent ambo in Latin: this is also what comes out of the table with the label "count".

NumType=Card because it implies exactly 2? I like that semantically. Currently the only DET receiving a NumType in English is half/NumType=Frac I believe.

NumType=Card because it involves a specific (discrete) quantity, and then we also add NumValue=2. But it is clear that ambo/both are not numerals in the sense duo/two are. Then there are also indefinite cardinal quantities like many ("count" in the table), which also should get NumType=Card, but PronType=Ind. Then, something that we find in English and other languages, but not e.g. in Latin, are terms for non-discrete quantities ("non-count" in the table), for which probably we need a new value like NumType=Quant, or else.

For words like another, there exists the feature PronType=Con, for "contrastive", which we are using in Latin. I think it can encompass (n)either (equivalent to Latin (ne)uter, as it were), and maybe some other elements, too.

Will this be added to the universal documentation as a possible value? If most treebanks don't use it it may be easier to stick with Ind for now than to have to make decisions about what Con means in English.

Why not start using Con anyway? Hoping that it will be "promoted" to universal status as I think it has a place there.

nschneid commented 1 year ago

At the end of the day, at least for English, the Ind/Tot/Neg distinction strikes me as more semantic than morphological. (There is nothing in common between the forms all, each, and every that suggests they belong together.) But the features are not capturing the full semantics: all and every are not interchangeable, yet they have the same features—there doesn't seem to be a goal of developing fine-grained features such that every DET or PRON with a different meaning receives a different combination of features.

From my perspective the best we can do now is converge on a table for English that aligns with https://universaldependencies.org/u/feat/PronType.html. A finer-grained universal theory of these features might be worth developing in the future, but I don't see English morphology as providing much guidance beyond what is already in the universal guidelines (the main opportunity I do see would be to add a feature to group together the -ever items).

Stormur commented 1 year ago

At the end of the day, at least for English, the Ind/Tot/Neg distinction strikes me as more semantic than morphological.

Well, of course it is. All PronTypes are by nature!

But the features are not capturing the full semantics: all and every are not interchangeable, yet they have the same features—there doesn't seem to be a goal of developing fine-grained features such that every DET or PRON with a different meaning receives a different combination of features.

Probably we will always have some kind of indeterminacy: else, at one extreme, we should have a specific tag for each form. But probably we can already do much with what we already have: for example, I was considering that every might get a NumType=Dist :thinking: I wonder how much one can proceed in this direction


With regard to another, has a multiword treatment as an+other never been considered? It seems rather straightforward, it is just a contrastive element with a mark for indefiniteness (an). Maybe this would avoid to being forced to choose a single label which will always be a "short blanket".

nschneid commented 1 year ago

With regard to another, has a multiword treatment as an+other never been considered? It seems rather straightforward, it is just a contrastive element with a mark for indefiniteness (an).

Historically yes, synchronically no. It would be an error to say/write "I have an other idea". Note the syllabification of /əˈnʌð.ɚ/ (not /ən ˈʌð.ɚ/).

Stormur commented 1 year ago

With regard to another, has a multiword treatment as an+other never been considered? It seems rather straightforward, it is just a contrastive element with a mark for indefiniteness (an).

Historically yes, synchronically no. It would be an error to say/write "I have an other idea". Note the syllabification of /əˈnʌð.ɚ/ (not /ən ˈʌð.ɚ/).

The orthography might mirror the phonetic coalescence of the two elements, but couldn't they still be considered "syntactically active"? Besides, how semantically different is another from an+other?

Now this might be a Pindaric flight, but if I think to Italian un altro / un'altra, it is written separate though phonetically it is just one unit, and there is absolutely no way to distinguish any different use as supposedly an other vs. another. Also, please correct me if I am wrong, but is it at all possible to use an other instead of the univerbation?

My point is that we are probably (once again) mislead by orthographic conventions here.

nschneid commented 1 year ago

I'm not sure how to test for a notion of "syntactically active" but my sense is that "another" is extremely well established as a word of English. It is listed in dictionaries (and not as a spelling variant of "an other"). As far as I am aware there is no tradition of tokenizing it as two words. (Just like we don't split "without" or "spreadsheet" or what have you.) So splitting it in UD would cause confusion IMO.

(There are probably more syntactic arguments that can be made here: e.g. stranding is possible, unlike semantically similar adjectives: I ate one muffin and now I want another/*an additional. But the bottom line is that another is normally regarded/tokenized as one word so it would require an extremely compelling reason to change that.)

Similar expressions in other languages may not be as far along in grammaticalization; we cannot expect the UD tree to be exactly parallel across languages.

Stormur commented 1 year ago

Here we are getting back to the issue of "what is a word"... I am just perplexed that a string which is by all means transparent in all its components and behaviour is kept together just for (motivated) orthographic conventions (as a note: in my opinion other should be considered a determiner rather than an adjective). The same cannot be said for without, I think. It may be just me, but I see much more confusion in keeping another together, while having an as a separate element in all other cases. But I fear the discussion would grow much more over the topic of DETs features here.

amir-zeldes commented 1 year ago

Splitting "another" would also lead to inconsistency with the tokenization in non-UD corpora, which I am very happy to say we have not diverged from so far. It's quite nice that tokenizers trained on the ~3M tokens in OntoNotes work quite well for UD data, since it's the same standard for what counts as a token.

Stormur commented 1 year ago

In fact, after some thought I recognise that splitting another might not be the ideal choice, principally because both split elements would be functional, and since it is ultimately irrelevant if we just consider that features like Definite=Ind can simply be annotated on it together with others. Also, what I substantially would vie for is to consider another as part of the paradigm of other, so maybe to lemmatise both the same way. But again, here the discussion goes beyond the topic of this issue, sorry.

nschneid commented 1 year ago

@amir-zeldes are you OK with the table in https://github.com/UniversalDependencies/docs/issues/971#issuecomment-1724761488?

sylvainkahane commented 1 year ago

Note that "lemma=other" exists with various DETs. In GUM: https://universal.grew.fr/?custom=650c7d874b8ac In EWT: https://universal.grew.fr/?custom=650c7cefaf4f0 By the way, it is annotated with upos=NOUN in GUM and upos=ADJ in EWT. What is remarkable is that this lemma can never have the DET "an". It is a very strange property. As a non-native speaker of English, I don't understand why "an other" is not possible and why it could be different from "another". So my intuition is the same as @Stormur and I don't see how you prove that "another" is something else than an orthographic quirk. @nschneid's answer was not sufficient.

nschneid commented 1 year ago

You are quite right that the blocking of the indefinite determiner is a strange property. It can be explained historically. But I have never seen an English dictionary, tokenizer, or reference grammar that treats "another" as an orthographic quirk. Here is CGEL for example:

image
amir-zeldes commented 1 year ago

are you OK with the table

Sorry, probably, I need to find a free moment to go over it in detail - will post here again once I've had a chance

amir-zeldes commented 1 year ago

OK, I've had a closer look now - the only thing I would change there is not treating "both" as a cardinal number. I agree it normally implies that there are two things, but it's still different from a cardinal number which gets a NumType IMO. We also don't assign these features to words like "pair" or "decade", even though they imply a count. Half seems OK though, since that's actually the English name of that number.

nschneid commented 1 year ago

OK, removed NumType=Card for "both"

Stormur commented 1 year ago

I'd just like to notice that NumType does not imply that a word is a cardinal number: that depends on the part of speech NUM, and if a word has one, it necessarily needs a NumType, while the converse is not true. The difference of both from two is given both by its part of speech and the presence of PronType.

As for considering a NumType for words like pair or decade, well... it actually does not sound too much an adventurous idea now that I hear it :thinking:

amir-zeldes commented 1 year ago

OK, the table is now implemented in GUM as well (see the UD dev branch for results). Let me know if you notice anything off!