Closed AngledLuffa closed 1 year ago
Here is a table from Quirk et al. 1985:
The ASSERTIVE group, discussed starting on p. 383, also includes many
, a few
, much
, little
, less
, least
, fewer
, fewest
, numerical one
, half
, several
, enough
, other
, others
, another
.
So it seems reasonable to classify these as PronType=Ind
unless a more specific feature applies (e.g. PronType=Tot
).
I take it for a given word, the PronType
feature should be the same regardless of whether it is tagged as DET or PRON?
For words like another, there exists the feature PronType=Con
, for "contrastive", which we are using in Latin. I think it can encompass (n)either (equivalent to Latin (ne)uter, as it were), and maybe some other elements, too.
For both I can suggest a treatment as Tot
paired with a NumType
. This is what we are doing for the equivalent ambo in Latin: this is also what comes out of the table with the label "count".
I think that the label PronType=Ind
is overused at the moment: the fact is that more or less everything pointed to deictically could be seen as "indefinite" in a sense, but this is misleading.
I take it for a given word, the
PronType
feature should be the same regardless of whether it is tagged as DET or PRON?
I believe it is generally true, but sometimes in some languages people treat one string as homonymous even within the PRON
category, and give it two different values of PronType
based on context. (Personally I prefer to avoid this approach unless its really coincidental string identity of words that also have different morphological paradigm.) If that happens, then you may want to project the distinction also across the PRON
-DET
boundary.
For both I can suggest a treatment as
Tot
paired with aNumType
. This is what we are doing for the equivalent ambo in Latin: this is also what comes out of the table with the label "count".
NumType=Card
because it implies exactly 2? I like that semantically. Currently the only DET receiving a NumType in English is half/NumType=Frac I believe.
For words like another, there exists the feature
PronType=Con
, for "contrastive", which we are using in Latin. I think it can encompass (n)either (equivalent to Latin (ne)uter, as it were), and maybe some other elements, too.
Will this be added to the universal documentation as a possible value? If most treebanks don't use it it may be easier to stick with Ind
for now than to have to make decisions about what Con
means in English.
Are we any closer to a definitive answer about what to do with another
? That word in particular is different between the two biggest English treebanks, and because it's so evenly balanced, the models I train from the two keep switching between whether they give another
a feature or not
It appears that PronType=Con
is only implemented for Latin: https://universaldependencies.org/ext-feat-index.html#prontype
I'd rather not start using it for English unless other languages are planning to start using it. My familiarity with other languages is limited but it appears that some French treebanks use PronType=Ind
for "autre" ('other'): https://universal.grew.fr/?custom=650642a6bf400 I would go with that for now in English and if there's a wider decision to adopt Con
it would be easy enough to revise as it's lexical.
Makes sense. @amir-zeldes ?
Fine by me, I can add it in GUM/GENTLE. So just PronType=Ind
for lemma="another" right? Any other changes needed in GUM?
Several others marked with ??? above are labeled differently between GUM and EWT: any
, both
, every
, no
, some
, all
, either
, neither
but if we're happen with the current labels in GUM, then I suppose the change needed would be to add those features to EWT, not alter them in GUM
Art should just be a/an/the. The rest should be Tot, Ind, or Rcp I think. Plus PronType=Neg or NumType=Card as appropriate.
On Mon, Sep 18, 2023, 3:52 PM John Bauer @.***> wrote:
Several others marked with ??? above are labeled differently between GUM and EWT: any, both, every, no, some, all, either, neither
but if we're happen with the current labels in GUM, then I suppose the change needed would be to add those features to EWT, not alter them in GUM
— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/971#issuecomment-1724274057, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHQRL5RBZC4RV3V3JC2FE3X3CQ7DANCNFSM6AAAAAA3XVZUJY . You are receiving this because you commented.Message ID: @.***>
How about the following as the features for DET (xpos DT or PDT or WDT):
Lexemes | PronType |
Other feats |
---|---|---|
a, an | Art |
Definite=Ind |
the | Art |
Definite=Def |
this, that | Dem |
Number=Sing |
these, those | Dem |
Number=Plur |
yonder | Dem |
|
all, each**, every | Tot |
|
both | Tot |
|
any, some, another, either | Ind |
|
such*, quite*, many* | Ind |
|
half* | Ind |
NumType=Frac |
no, neither, nary* | Neg |
|
which, what, whatever | Int or Rel |
* Only DET as a predeterminer (N.B. I am not thrilled with the current practice of tagging "such", "quite", and "many" as DET in their predeterminer uses as it seems like unnecessary multiplication of tag ambiguity, but it follows from treating every PDT as DET: UniversalDependencies/UD_English-EWT#412) ** Except reciprocal each other: see PRON
such - there is a case where it has the xpos DT instead of PDT, which seems weird
@amir-zeldes I think this one is an error - should be JJ/ADJ
I've written a Grew-match request which "implements" @nschneid's table.
It might help to have a global view of the DET
that do not follow the table.
For instance:
Thanks @bguil. Looking at the queries made me realize there were issues with the last 2 rows of the table.
Because no indicates a negation of all items in a set, should it be PronType=Neg,Tot
? Or would that be confusing?
Actually, I've changed my mind: let's go with plain PronType=Neg
for no, neither, nary. Simpler is better I think (though an argument could be made that Tot
and Neg
are subtypes of Ind
).
Because no indicates a negation of all items in a set, should it be
PronType=Neg,Tot
? Or would that be confusing?
That would be confusing. Neg
is for no, Tot
is for all.
I find PronType=Neg
extremely problematic, if not misguided. I remember staring for a long time at the guidelines and comparing it to Polarity=Neg
without understanding the rationale. Because what I observe is that all "negative pronouns" are pronouns/etc. of some kind with negative polarity: if we use PronType=Neg
, we obscure the type. I would suggest to ditch PronType=Neg
and to just use Polarity=Neg
. So we will have indefinite negative pronouns/etc. as nobody = not any, contrastive negative pronouns/etc. as lat. neuter = _ne+uter "neither", and so on.
For both I can suggest a treatment as
Tot
paired with aNumType
. This is what we are doing for the equivalent ambo in Latin: this is also what comes out of the table with the label "count".
NumType=Card
because it implies exactly 2? I like that semantically. Currently the only DET receiving a NumType in English is half/NumType=Frac I believe.
NumType=Card
because it involves a specific (discrete) quantity, and then we also add NumValue=2
. But it is clear that ambo/both are not numerals in the sense duo/two are. Then there are also indefinite cardinal quantities like many ("count" in the table), which also should get NumType=Card
, but PronType=Ind
. Then, something that we find in English and other languages, but not e.g. in Latin, are terms for non-discrete quantities ("non-count" in the table), for which probably we need a new value like NumType=Quant
, or else.
For words like another, there exists the feature
PronType=Con
, for "contrastive", which we are using in Latin. I think it can encompass (n)either (equivalent to Latin (ne)uter, as it were), and maybe some other elements, too.Will this be added to the universal documentation as a possible value? If most treebanks don't use it it may be easier to stick with
Ind
for now than to have to make decisions about whatCon
means in English.
Why not start using Con
anyway? Hoping that it will be "promoted" to universal status as I think it has a place there.
At the end of the day, at least for English, the Ind
/Tot
/Neg
distinction strikes me as more semantic than morphological. (There is nothing in common between the forms all, each, and every that suggests they belong together.) But the features are not capturing the full semantics: all and every are not interchangeable, yet they have the same features—there doesn't seem to be a goal of developing fine-grained features such that every DET or PRON with a different meaning receives a different combination of features.
From my perspective the best we can do now is converge on a table for English that aligns with https://universaldependencies.org/u/feat/PronType.html. A finer-grained universal theory of these features might be worth developing in the future, but I don't see English morphology as providing much guidance beyond what is already in the universal guidelines (the main opportunity I do see would be to add a feature to group together the -ever items).
At the end of the day, at least for English, the
Ind
/Tot
/Neg
distinction strikes me as more semantic than morphological.
Well, of course it is. All PronType
s are by nature!
But the features are not capturing the full semantics: all and every are not interchangeable, yet they have the same features—there doesn't seem to be a goal of developing fine-grained features such that every DET or PRON with a different meaning receives a different combination of features.
Probably we will always have some kind of indeterminacy: else, at one extreme, we should have a specific tag for each form. But probably we can already do much with what we already have: for example, I was considering that every might get a NumType=Dist
:thinking: I wonder how much one can proceed in this direction
With regard to another, has a multiword treatment as an+other never been considered? It seems rather straightforward, it is just a contrastive element with a mark for indefiniteness (an). Maybe this would avoid to being forced to choose a single label which will always be a "short blanket".
With regard to another, has a multiword treatment as an+other never been considered? It seems rather straightforward, it is just a contrastive element with a mark for indefiniteness (an).
Historically yes, synchronically no. It would be an error to say/write "I have an other idea". Note the syllabification of /əˈnʌð.ɚ/ (not /ən ˈʌð.ɚ/).
With regard to another, has a multiword treatment as an+other never been considered? It seems rather straightforward, it is just a contrastive element with a mark for indefiniteness (an).
Historically yes, synchronically no. It would be an error to say/write "I have an other idea". Note the syllabification of /əˈnʌð.ɚ/ (not /ən ˈʌð.ɚ/).
The orthography might mirror the phonetic coalescence of the two elements, but couldn't they still be considered "syntactically active"? Besides, how semantically different is another from an+other?
Now this might be a Pindaric flight, but if I think to Italian un altro / un'altra, it is written separate though phonetically it is just one unit, and there is absolutely no way to distinguish any different use as supposedly an other vs. another. Also, please correct me if I am wrong, but is it at all possible to use an other instead of the univerbation?
My point is that we are probably (once again) mislead by orthographic conventions here.
I'm not sure how to test for a notion of "syntactically active" but my sense is that "another" is extremely well established as a word of English. It is listed in dictionaries (and not as a spelling variant of "an other"). As far as I am aware there is no tradition of tokenizing it as two words. (Just like we don't split "without" or "spreadsheet" or what have you.) So splitting it in UD would cause confusion IMO.
(There are probably more syntactic arguments that can be made here: e.g. stranding is possible, unlike semantically similar adjectives: I ate one muffin and now I want another/*an additional. But the bottom line is that another is normally regarded/tokenized as one word so it would require an extremely compelling reason to change that.)
Similar expressions in other languages may not be as far along in grammaticalization; we cannot expect the UD tree to be exactly parallel across languages.
Here we are getting back to the issue of "what is a word"... I am just perplexed that a string which is by all means transparent in all its components and behaviour is kept together just for (motivated) orthographic conventions (as a note: in my opinion other should be considered a determiner rather than an adjective). The same cannot be said for without, I think. It may be just me, but I see much more confusion in keeping another together, while having an as a separate element in all other cases. But I fear the discussion would grow much more over the topic of DET
s features here.
Splitting "another" would also lead to inconsistency with the tokenization in non-UD corpora, which I am very happy to say we have not diverged from so far. It's quite nice that tokenizers trained on the ~3M tokens in OntoNotes work quite well for UD data, since it's the same standard for what counts as a token.
In fact, after some thought I recognise that splitting another might not be the ideal choice, principally because both split elements would be functional, and since it is ultimately irrelevant if we just consider that features like Definite=Ind
can simply be annotated on it together with others. Also, what I substantially would vie for is to consider another as part of the paradigm of other, so maybe to lemmatise both the same way. But again, here the discussion goes beyond the topic of this issue, sorry.
@amir-zeldes are you OK with the table in https://github.com/UniversalDependencies/docs/issues/971#issuecomment-1724761488?
Note that "lemma=other" exists with various DETs. In GUM: https://universal.grew.fr/?custom=650c7d874b8ac In EWT: https://universal.grew.fr/?custom=650c7cefaf4f0 By the way, it is annotated with upos=NOUN in GUM and upos=ADJ in EWT. What is remarkable is that this lemma can never have the DET "an". It is a very strange property. As a non-native speaker of English, I don't understand why "an other" is not possible and why it could be different from "another". So my intuition is the same as @Stormur and I don't see how you prove that "another" is something else than an orthographic quirk. @nschneid's answer was not sufficient.
You are quite right that the blocking of the indefinite determiner is a strange property. It can be explained historically. But I have never seen an English dictionary, tokenizer, or reference grammar that treats "another" as an orthographic quirk. Here is CGEL for example:
are you OK with the table
Sorry, probably, I need to find a free moment to go over it in detail - will post here again once I've had a chance
OK, I've had a closer look now - the only thing I would change there is not treating "both" as a cardinal number. I agree it normally implies that there are two things, but it's still different from a cardinal number which gets a NumType IMO. We also don't assign these features to words like "pair" or "decade", even though they imply a count. Half seems OK though, since that's actually the English name of that number.
OK, removed NumType=Card
for "both"
I'd just like to notice that NumType
does not imply that a word is a cardinal number: that depends on the part of speech NUM
, and if a word has one, it necessarily needs a NumType,
while the converse is not true. The difference of both from two is given both by its part of speech and the presence of PronType.
As for considering a NumType
for words like pair or decade, well... it actually does not sound too much an adventurous idea now that I hear it :thinking:
OK, the table is now implemented in GUM as well (see the UD dev branch for results). Let me know if you notice anything off!
As discussed here, I suggest we create a DET table with the known determiners and their features for English. That way, we can unify that across treebanks.
https://github.com/UniversalDependencies/UD_English-EWT/issues/416#issuecomment-1685437948
As part of this process, I also suggest including features for words such as
another
, possibly by inventing a new feature which matches. Currently EWT has no features onanother
, whereas GUM hasPronType=Art
.So, for example, we have
furthermore, there are a couple instances of a typo not getting the proper features (in EWT, I believe): his/this, Thi$/this
and then there's PUD which is labeling "those" with the lemma "those" as opposed to "these"
there's also other DETs with the xpos PDT, such as
all
,quite
,half
,both
,such
,nary
,many