Closed nschneid closed 11 months ago
For the illiterate among us like me, I assume the unlinked CGEL reference is The Cambridge Grammar of the English Language? https://www.cambridge.org/features/linguistics/cgel/default.htm
My first remark is that UD features are defined as pertaining to individual syntactic words. Not to larger units such as clauses. So, for example, we have a feature for past perfect (Tense=Ppq
), which is used in Portuguese, where one of the morphological forms of the verb expresses this tense, but it is never used in English, which also has the past perfect tense, but it is expressed periphrastically and none of the participating words is specifically past perfect: I had.Tense=Past|VerbForm=Fin
seen.Tense=Past|VerbForm=Part
it. Along the same lines, you can have an interrogative clause, but in English the mood of the verbs inside it is indicative. Clause-level features would be an interesting enhancement but they should not appear in the FEATS column; MISC would be appropriate.
Other comments:
Mood=Int
already exists as a language-specific value because some languages have a morpheme that is added to the verb when it should be a yes-no question. It is currently used in Irish, Scottish Gaelic, Uyghur, and Yupik.
Annotating the comparative on English than would be IMHO OK – the same feature value is often used for morphological forms in one language, and for function words with the same function in other languages. If you want to do it, then it is definitely Case=Cmp
, not Degree=Cmp
. And as always, it would be nice if the maintainers of all the 9 English treebanks are persuaded to do the same.
Mood=Int
already exists as a language-specific value because some languages have a morpheme that is added to the verb when it should be a yes-no question. It is currently used in Irish, Scottish Gaelic, Uyghur, and Yupik.
Oh, good—should it be added to the universal guidelines page then?
Along the same lines, you can have an interrogative clause, but in English the mood of the verbs inside it is indicative.
English (EWT & GUM) uses Mood=Ind
on all VERBs with VerbForm=Fin
as well as non-modal AUXes with VerbForm=Fin
. But I'm not exactly sure of the rationale there (it basically means "finite non-modal"?). Note that a clause with a modal verb will not have a Mood
feature on any word, because then the verb is not finite. Is the solution to label the modal AUXes with Mood=Pot
("can", "might", ...) and Mood=Nec
("should", "must")? What about future "will"?
MISC would be appropriate.
Aha, now I see https://universaldependencies.org/misc#stype
Mood=Int
already exists as a language-specific value ...Oh, good—should it be added to the universal guidelines page then?
I would be for adding it. It was among the candidates for extension of the feature-value space already in UD v2 but in the end it was not included. Now that it is actually attested in four UD languages (and possibly in some others that have the phenomenon but have not defined the feature value), it would make sense to me to promote it to one of the universal feature values. A few similar additions have silently happened since UD v2, although I try to be conservative and not to add everything I stumble upon.
English (EWT & GUM) uses
Mood=Ind
on all VERBs withVerbForm=Fin
as well as non-modal AUXes withVerbForm=Fin
. But I'm not exactly sure of the rationale there (it basically means "finite non-modal"?). Note that a clause with a modal verb will not have aMood
feature on any word, because then the verb is not finite. Is the solution to label the modal AUXes withMood=Pot
("can", "might", ...) andMood=Nec
("should", "must")? What about future "will"?
I was not the one to design the English-specific guidelines, so I cannot explain the rationale behind them. One might argue that English (almost) does not have morphological mood, but there are some traces preserved: for example, 2nd person of to be is are/were in the indicative, but be in the imperative, and the same form can also be used in the subjunctive (see here for relevant statistics from EWT). Labeling the appropriate auxiliaries with Mood=Pot
and Mood=Nec
sounds good to me. As for will, I don't know why the English grammar classifies it as a modal. For me, it was always simply the future auxiliary. If I wanted to assign a Mood
value to every AUX in English, then I would give it Mood=Ind
.
I am with Dan on most of the issues about Mood
. It is just not so clear to me why:
Annotating the comparative on English than would be IMHO OK – the same feature value is often used for morphological forms in one language, and for function words with the same function in other languages. If you want to do it, then it is definitely
Case=Cmp
, notDegree=Cmp
. And as always, it would be nice if the maintainers of all the 9 English treebanks are persuaded to do the same.
That is, why using Case
instead of Degree
? I, too, have been musing for some time on how to annotate the roughly equivalent tantus/tantum &co. with a value for degree, but I have not yet been able to decide myself or formalise it well enough. But I was thinking of Degree
, since it is an absolute parallel to an adjective bearing degree and "endowing" it to the phrase it belongs to, it just happens to be a DET
/ADV
. But maybe than has a very different function, is it not an SCONJ
?
2. Exclamatives
Just my 2 cents on this: I feel the "exclamative" features are kind of ghost features, i.e. they do not truly represent anything. All these sentences like "What big eyes you have!" and their innumerable equivalents in other languages are all based on questions and question words which only pragmatically become exclamations (they are a sort of rhetorical question). So their being exclamative happens on a different annotation layer which is not in morphosyntax, where we can content ourselves with PronType=Int
. It is also better for not further fragmenting the PronType
space in my opinion.
why using
Case
instead ofDegree
?
Degree
is a feature of the quality being compared/graded (that is, of the adjective, in some languages adverb), while Case
marks the standard of comparison. The English conjunction than marks the standard of comparison.
I feel the "exclamative" features are kind of ghost features
Agreed.
why using
Case
instead ofDegree
?
Degree
is a feature of the quality being compared/graded (that is, of the adjective, in some languages adverb), whileCase
marks the standard of comparison. The English conjunction than marks the standard of comparison.
Oops, I got confused on this, probably because in the original post than is together with as, which appears in constructions like as good as... So do you think that a Degree=Equ
would make sense for tantum 'so much' in Latin (which is a transparently derived adverbial form from tantus) & co.? It is used to prepare the ground for a comparison of equivalence; in Latin it's just not mandatory, probably this is because this kind of degree has never been widely considered.
I knowingly avoided reacting to the Latin thing, of which I know very little; but since you insist :-)
Assuming that it works similarly to Spanish tanto … como …, Degree=Equ
probably isn't wrong. But the potential benefit of using the feature is not clear to me. It isn't a function word that provides another word with the equative degree while by default that word would be Degree=Pos
, or is it? Also, tantum itself probably is not an equative inflection of a lemma that also has forms for other degrees.
This is a really complex issue... In general I agree that this is interesting information to encode somewhere, but I'm hesitant to call a lot of these things Mood. Maybe it's just the stuffy Indo-European tradition, but also in Afro-Asiatic the term mood (unlike 'modality') is generally used to apply to morphological forms of verbs (subjunctive, optative, imperative, etc.)
Looking at it from the UD perspective, as @dan-zeman pointed out the Mood
feature is word-level, as it has to be for inclusion in FEATS. Exclamatives and questions (which can be expressed using inversion, or just intonation) are more constructional feautres, and not really moods in the traditional sense, though I think they are more like 'sentence moods' or rough speech act types. For this reason, GUM annotates them at the sentence level using # s_type
and the SPAAC tagset, whose values include wh
for wh-questions and q
for all other questions. However a weakness of this is that as @nschneid pointed out, these are really clause-level properties, and you could coordinate multiple types ("should we go and why?"). This leads to a somewhat arbitrary hierarchy of specificity in GUM and a type multiple
, which is kind of useless.
So maybe some kind of MISC annotation is the best way to encode it. Either way, I don't feel like UD is currently equipped for annotating 'constructions', so this kind stuff is kind of above the level of the UD graph. Things like sentence-mood border on speech act annotation, and things like 'comparison' border on discourse functional annotation, both of which are usually implemented in frameworks outside of UD (but which we've also crammed into CoNLL-U MISC on occasion)
Let's discuss in our next meeting. I realize MISC is always an option, but my gut feeling is that it is unsatisfying because clauses (as I understand it) are part of UD's theory of syntax, so there is a gap if we say we have universal standards for word morphology and clause syntactic relations but not clause morphosyntax.
Relation subtypes are in principle another place this info could go (ccomp:q or ccomp:int or what have you), but this would get cluttered very quickly—features seem like the better solution. Or if there was a way to define features over basic UD edges....
I agree it would be nice to be able to annotate things like this in UD (and also, if we have a resource annotated for this kind of information in another framework, we could have a recommendation on how to represent it in CoNLL-U), but I'm not sure if this is really a feature of edges or more of (possibly nested) spans. Some of the things we were talking about above are regularly expressed in discourse parses, which do not have to mirror syntactic structure. I know it might seem intuitive to consider things like sentence-mood/speech act to be clause level, but it can get rather complex because we can have things like unlike-coordination with shared modifiers etc. and some of these things are really already in semantics, not morphosyntax (an as-clause implying a comparison is still just an advcl syntactically, and an exclamative can be just an NP syntactically, or an NP + a particle or adverb)
Yes these are related to things in semantics/pragmatics, but not identical: e.g. clauses with declarative syntax can serve as questions ("They already left?", "He ate what?").
Right, so would you want to tag those as interrogative or not?
No. "Interrogative" for the grammatical form with subject-auxiliary inversion etc., "question" for the meaning.
That would make things tricky cross-linguistically: "He ate what?" would be the canonical form of an interrogative sentence in Mandarin, and I wouldn't know what to tell non-English UD annotators to do then. I also don't find it as useful to only know when a sentence is a question AND has inversion, I'd rather know how to search for questions, or inversion, or both.
I would like to reopen this discussion for the case of interrogative words. I think that interrogative words should be treated like negative words. In Indo-European languages we do not have an interrogative mood nor a negative mood and Mood=Int
would be inappropriate. We use pronouns and particles (and prosody and word order alternations) and the verb remains in the indicative mood. Negative pronouns are PronType=Neg
. Negative particles are Polarity=Neg
. Interrogative pronouns are PronType=Int
. The feature for interrogative particles and constructions should be something of the form Xxx=Int
.
This feature would be added on words such as En. whether or Fr. est-ce que /esk/, as well as on the main verb of an interrogative clause when there is no lexical marker. We will add this feature on our treebanks. Has someone a name to propose for Xxx
?
PartType=Int is a reasonable annotation for question particles like [en] whether or [pl] czy. And obviously it is already used in some treebanks.
To add anything like that to the head of the clause when no lexical marker is present is problematic. The features in FEATS are annotations of words, not of clauses. If the word is not there then its annotation is not there either.
As I understand it, features that are really about clauses rather than words should go in MISC, at least for now. @sylvainkahane is that what you mean by marking interrogative "constructions"? If I were to invent a name for that it would be ClType
.
Stype
already appears in some treebanks. I can't tell if it is meant to be strictly morphosyntactic, or refers to the meaning (speech act), possibly signaled by punctuation. "You ate here yesterday?" would be interrogative meaning and intonation/punctuation but declarative word order/marking, so perhaps ClType=Decl|Stype=Int
.
I agree with @nschneid and @dan-zeman - since this is not a feature of words, it should ideally not be in FEATS, and # Stype =
or similar does much the same job at the sentence level in several corpora. In English GUM we put sentence forms based on the SPAAC scheme per sentence, which include values like wh
for a WH question and q
for a polar question, including for questions marked only by intonation.
I also agree with @nschneid that more properly this should be a property of clauses (and in fact this is why GUM inconveniently has a type multiple
for when a sentence has two main clauses of different types (e.g. "Did you go and if so, when did you go?")
Thank for your answers. I adopt PartType=Int
for interrogative particles like whether or the French MWE est-ce que, ClType=Int
for clauses with a (non-lexically) marked form like Do you agree? or French Es-tu ok?, and SentType=Int
for interrogative sentences with or without a marked form, that is, You agree?, as well Do you agree?.
As a non-native speaker I have a last question about the normalization of all these features. We already have PronType
and PartType
. Stype
does not follow the same pattern and I suppose that SType
or rather SentType
will be better. Do you prefer ClType
or a more explicit ClauseType
?
I am also contemplating SubClType
for subordinate clauses, e.g. ClType=Excl|SubClType=Cont
for exclamative content clauses in English (nert-nlp/cgel#10). Hence the brevity of Cl
rather than Clause
.
One could imagine further Cl
-prefixed features based on current features, e.g. ClAspect=Prog
(which is expressed constructionally in English, hence we don't mark it on the verb).
Agree that lowercase "t" in Stype
is anomalous.
Agree that lowercase "t" in
Stype
is anomalous.
If I recall it correctly, Stype was an attempt to preserve annotation that existed in the pre-UD treebanks of Hindi and Urdu, I think it was all-lowercase there (stype=declarative
), so it was just capitalized in our MISC to make it look more like the rest. Nothing similar was known elsewhere in that-day UD data and no attempt was done at UD-wide standardization. It could be converted to something else of course.
It could be converted to something else of course.
No objection, I'm happy to switch any data I maintain to SType
I am also contemplating SubClType for subordinate clauses
I think the taxonomy of clause types would be much broader than 'SType's, which only cover matrix clauses (declarative, imperative, interrogative, etc.), so that taxonomy would have to be developed in a universal way first. I also agree with Dan that these are not word properties; The current Stype
annotations are conllu sentence level comment annotations. If we did that for clauses, and wanted to avoid putting this on words, we would need to figure out a syntax for referring to spans of tokens in the sentence from sentence comments.
In any case, annotating this for sizable resources would be challenging (unless done automatically), so until someone implements such an annotation project I guess we don't need to worry too much about the details?
For English, CGEL defines a taxonomy of clause types, each associated with properties of finiteness, word order, marking, and embedding (e.g. imperatives cannot be subordinate clauses).
To a large extent the main clause types overlap with the
Mood
feature, which allowsInd
for indicative,Imp
for imperative,Sub
for subjunctive, andOpt
for optative.CGEL's subordinate clause types not covered by the above can largely be inferred from
VerbForm
(participial forGer
orPart
, infinitival forInf
) or theacl:relcl
deprel in the case of relative clauses.Three major clause types are not readily expressed with current UD labels:
Interrogatives
PronType=Int
to mark interrogative WH-pronouns (distinct from relative WH-pronouns), but this is not indicated at the clause level, so one has to resort to heuristics such as checking for subject-auxiliary inversion to identify interrogative clauses.Mood=Int
(orMood=ClosedInt
andMood=OpenInt
) be appropriate?Exclamatives
Mood=Exc
on the predicate? Note that we already have the option ofPronType=Exc
; it's not clear whether that should apply to "what" and "how" or if those should be left asPronType=Int
, but in any case it seems that the clause as a whole is exclamative.Comparatives
Case=Cmp
andDegree=Cmp
. Should one of those be applied to relevant instances of "than"/"as"/"like"? Or should comparative status be marked directly on the clause?Would these categories be applicable to other languages as well?