UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
272 stars 247 forks source link

Add features for interrogative, exclamative, and comparative clauses? #877

Closed nschneid closed 11 months ago

nschneid commented 2 years ago

For English, CGEL defines a taxonomy of clause types, each associated with properties of finiteness, word order, marking, and embedding (e.g. imperatives cannot be subordinate clauses).

To a large extent the main clause types overlap with the Mood feature, which allows Ind for indicative, Imp for imperative, Sub for subjunctive, and Opt for optative.

CGEL's subordinate clause types not covered by the above can largely be inferred from VerbForm (participial for Ger or Part, infinitival for Inf) or the acl:relcl deprel in the case of relative clauses.

Three major clause types are not readily expressed with current UD labels:

  1. Interrogatives

    • There are two kinds of English interrogatives: open (WH) and closed (polar and multiple choice).
    • We have PronType=Int to mark interrogative WH-pronouns (distinct from relative WH-pronouns), but this is not indicated at the clause level, so one has to resort to heuristics such as checking for subject-auxiliary inversion to identify interrogative clauses.
    • Would Mood=Int (or Mood=ClosedInt and Mood=OpenInt) be appropriate?
  2. Exclamatives

    • These start with "what" or "how", e.g., "What big eyes you have!", "What a nice day (it is)!", "How tall you are!", "I love what a nice day it is!", "I can't believe how tall you are", "I can't believe [how much money] you make". Note the lack of subject-auxiliary inversion in the main clauses. Subordinate clauses may be ambiguous between interrogative and exclamative interpretations.
    • Would it make sense to add Mood=Exc on the predicate? Note that we already have the option of PronType=Exc; it's not clear whether that should apply to "what" and "how" or if those should be left as PronType=Int, but in any case it seems that the clause as a whole is exclamative.
  3. Comparatives

    • These start with "than", "as", or "like" (though like-clauses are ambiguous; they can also be content clauses).
    • We currently have Case=Cmp and Degree=Cmp. Should one of those be applied to relevant instances of "than"/"as"/"like"? Or should comparative status be marked directly on the clause?

Would these categories be applicable to other languages as well?

mehmetoguzderin commented 2 years ago

For the illiterate among us like me, I assume the unlinked CGEL reference is The Cambridge Grammar of the English Language? https://www.cambridge.org/features/linguistics/cgel/default.htm

dan-zeman commented 2 years ago

My first remark is that UD features are defined as pertaining to individual syntactic words. Not to larger units such as clauses. So, for example, we have a feature for past perfect (Tense=Ppq), which is used in Portuguese, where one of the morphological forms of the verb expresses this tense, but it is never used in English, which also has the past perfect tense, but it is expressed periphrastically and none of the participating words is specifically past perfect: I had.Tense=Past|VerbForm=Fin seen.Tense=Past|VerbForm=Part it. Along the same lines, you can have an interrogative clause, but in English the mood of the verbs inside it is indicative. Clause-level features would be an interesting enhancement but they should not appear in the FEATS column; MISC would be appropriate.

Other comments:

Mood=Int already exists as a language-specific value because some languages have a morpheme that is added to the verb when it should be a yes-no question. It is currently used in Irish, Scottish Gaelic, Uyghur, and Yupik.

Annotating the comparative on English than would be IMHO OK – the same feature value is often used for morphological forms in one language, and for function words with the same function in other languages. If you want to do it, then it is definitely Case=Cmp, not Degree=Cmp. And as always, it would be nice if the maintainers of all the 9 English treebanks are persuaded to do the same.

nschneid commented 2 years ago

Mood=Int already exists as a language-specific value because some languages have a morpheme that is added to the verb when it should be a yes-no question. It is currently used in Irish, Scottish Gaelic, Uyghur, and Yupik.

Oh, good—should it be added to the universal guidelines page then?

Along the same lines, you can have an interrogative clause, but in English the mood of the verbs inside it is indicative.

English (EWT & GUM) uses Mood=Ind on all VERBs with VerbForm=Fin as well as non-modal AUXes with VerbForm=Fin. But I'm not exactly sure of the rationale there (it basically means "finite non-modal"?). Note that a clause with a modal verb will not have a Mood feature on any word, because then the verb is not finite. Is the solution to label the modal AUXes with Mood=Pot ("can", "might", ...) and Mood=Nec ("should", "must")? What about future "will"?

MISC would be appropriate.

Aha, now I see https://universaldependencies.org/misc#stype

dan-zeman commented 2 years ago

Mood=Int already exists as a language-specific value ...

Oh, good—should it be added to the universal guidelines page then?

I would be for adding it. It was among the candidates for extension of the feature-value space already in UD v2 but in the end it was not included. Now that it is actually attested in four UD languages (and possibly in some others that have the phenomenon but have not defined the feature value), it would make sense to me to promote it to one of the universal feature values. A few similar additions have silently happened since UD v2, although I try to be conservative and not to add everything I stumble upon.

English (EWT & GUM) uses Mood=Ind on all VERBs with VerbForm=Fin as well as non-modal AUXes with VerbForm=Fin. But I'm not exactly sure of the rationale there (it basically means "finite non-modal"?). Note that a clause with a modal verb will not have a Mood feature on any word, because then the verb is not finite. Is the solution to label the modal AUXes with Mood=Pot ("can", "might", ...) and Mood=Nec ("should", "must")? What about future "will"?

I was not the one to design the English-specific guidelines, so I cannot explain the rationale behind them. One might argue that English (almost) does not have morphological mood, but there are some traces preserved: for example, 2nd person of to be is are/were in the indicative, but be in the imperative, and the same form can also be used in the subjunctive (see here for relevant statistics from EWT). Labeling the appropriate auxiliaries with Mood=Pot and Mood=Nec sounds good to me. As for will, I don't know why the English grammar classifies it as a modal. For me, it was always simply the future auxiliary. If I wanted to assign a Mood value to every AUX in English, then I would give it Mood=Ind.

Stormur commented 2 years ago

I am with Dan on most of the issues about Mood. It is just not so clear to me why:

Annotating the comparative on English than would be IMHO OK – the same feature value is often used for morphological forms in one language, and for function words with the same function in other languages. If you want to do it, then it is definitely Case=Cmp, not Degree=Cmp. And as always, it would be nice if the maintainers of all the 9 English treebanks are persuaded to do the same.

That is, why using Case instead of Degree? I, too, have been musing for some time on how to annotate the roughly equivalent tantus/tantum &co. with a value for degree, but I have not yet been able to decide myself or formalise it well enough. But I was thinking of Degree, since it is an absolute parallel to an adjective bearing degree and "endowing" it to the phrase it belongs to, it just happens to be a DET/ADV. But maybe than has a very different function, is it not an SCONJ?

2. Exclamatives

Just my 2 cents on this: I feel the "exclamative" features are kind of ghost features, i.e. they do not truly represent anything. All these sentences like "What big eyes you have!" and their innumerable equivalents in other languages are all based on questions and question words which only pragmatically become exclamations (they are a sort of rhetorical question). So their being exclamative happens on a different annotation layer which is not in morphosyntax, where we can content ourselves with PronType=Int. It is also better for not further fragmenting the PronType space in my opinion.

dan-zeman commented 2 years ago

why using Case instead of Degree?

Degree is a feature of the quality being compared/graded (that is, of the adjective, in some languages adverb), while Case marks the standard of comparison. The English conjunction than marks the standard of comparison.

dan-zeman commented 2 years ago

I feel the "exclamative" features are kind of ghost features

Agreed.

Stormur commented 2 years ago

why using Case instead of Degree?

Degree is a feature of the quality being compared/graded (that is, of the adjective, in some languages adverb), while Case marks the standard of comparison. The English conjunction than marks the standard of comparison.

Oops, I got confused on this, probably because in the original post than is together with as, which appears in constructions like as good as... So do you think that a Degree=Equ would make sense for tantum 'so much' in Latin (which is a transparently derived adverbial form from tantus) & co.? It is used to prepare the ground for a comparison of equivalence; in Latin it's just not mandatory, probably this is because this kind of degree has never been widely considered.

dan-zeman commented 2 years ago

I knowingly avoided reacting to the Latin thing, of which I know very little; but since you insist :-)

Assuming that it works similarly to Spanish tanto … como …, Degree=Equ probably isn't wrong. But the potential benefit of using the feature is not clear to me. It isn't a function word that provides another word with the equative degree while by default that word would be Degree=Pos, or is it? Also, tantum itself probably is not an equative inflection of a lemma that also has forms for other degrees.

amir-zeldes commented 2 years ago

This is a really complex issue... In general I agree that this is interesting information to encode somewhere, but I'm hesitant to call a lot of these things Mood. Maybe it's just the stuffy Indo-European tradition, but also in Afro-Asiatic the term mood (unlike 'modality') is generally used to apply to morphological forms of verbs (subjunctive, optative, imperative, etc.)

Looking at it from the UD perspective, as @dan-zeman pointed out the Mood feature is word-level, as it has to be for inclusion in FEATS. Exclamatives and questions (which can be expressed using inversion, or just intonation) are more constructional feautres, and not really moods in the traditional sense, though I think they are more like 'sentence moods' or rough speech act types. For this reason, GUM annotates them at the sentence level using # s_type and the SPAAC tagset, whose values include wh for wh-questions and q for all other questions. However a weakness of this is that as @nschneid pointed out, these are really clause-level properties, and you could coordinate multiple types ("should we go and why?"). This leads to a somewhat arbitrary hierarchy of specificity in GUM and a type multiple, which is kind of useless.

So maybe some kind of MISC annotation is the best way to encode it. Either way, I don't feel like UD is currently equipped for annotating 'constructions', so this kind stuff is kind of above the level of the UD graph. Things like sentence-mood border on speech act annotation, and things like 'comparison' border on discourse functional annotation, both of which are usually implemented in frameworks outside of UD (but which we've also crammed into CoNLL-U MISC on occasion)

nschneid commented 2 years ago

Let's discuss in our next meeting. I realize MISC is always an option, but my gut feeling is that it is unsatisfying because clauses (as I understand it) are part of UD's theory of syntax, so there is a gap if we say we have universal standards for word morphology and clause syntactic relations but not clause morphosyntax.

Relation subtypes are in principle another place this info could go (ccomp:q or ccomp:int or what have you), but this would get cluttered very quickly—features seem like the better solution. Or if there was a way to define features over basic UD edges....

amir-zeldes commented 2 years ago

I agree it would be nice to be able to annotate things like this in UD (and also, if we have a resource annotated for this kind of information in another framework, we could have a recommendation on how to represent it in CoNLL-U), but I'm not sure if this is really a feature of edges or more of (possibly nested) spans. Some of the things we were talking about above are regularly expressed in discourse parses, which do not have to mirror syntactic structure. I know it might seem intuitive to consider things like sentence-mood/speech act to be clause level, but it can get rather complex because we can have things like unlike-coordination with shared modifiers etc. and some of these things are really already in semantics, not morphosyntax (an as-clause implying a comparison is still just an advcl syntactically, and an exclamative can be just an NP syntactically, or an NP + a particle or adverb)

nschneid commented 2 years ago

Yes these are related to things in semantics/pragmatics, but not identical: e.g. clauses with declarative syntax can serve as questions ("They already left?", "He ate what?").

amir-zeldes commented 2 years ago

Right, so would you want to tag those as interrogative or not?

nschneid commented 2 years ago

No. "Interrogative" for the grammatical form with subject-auxiliary inversion etc., "question" for the meaning.

amir-zeldes commented 2 years ago

That would make things tricky cross-linguistically: "He ate what?" would be the canonical form of an interrogative sentence in Mandarin, and I wouldn't know what to tell non-English UD annotators to do then. I also don't find it as useful to only know when a sentence is a question AND has inversion, I'd rather know how to search for questions, or inversion, or both.

sylvainkahane commented 1 year ago

I would like to reopen this discussion for the case of interrogative words. I think that interrogative words should be treated like negative words. In Indo-European languages we do not have an interrogative mood nor a negative mood and Mood=Intwould be inappropriate. We use pronouns and particles (and prosody and word order alternations) and the verb remains in the indicative mood. Negative pronouns are PronType=Neg. Negative particles are Polarity=Neg. Interrogative pronouns are PronType=Int. The feature for interrogative particles and constructions should be something of the form Xxx=Int. This feature would be added on words such as En. whether or Fr. est-ce que /esk/, as well as on the main verb of an interrogative clause when there is no lexical marker. We will add this feature on our treebanks. Has someone a name to propose for Xxx?

dan-zeman commented 1 year ago

PartType=Int is a reasonable annotation for question particles like [en] whether or [pl] czy. And obviously it is already used in some treebanks.

To add anything like that to the head of the clause when no lexical marker is present is problematic. The features in FEATS are annotations of words, not of clauses. If the word is not there then its annotation is not there either.

nschneid commented 1 year ago

As I understand it, features that are really about clauses rather than words should go in MISC, at least for now. @sylvainkahane is that what you mean by marking interrogative "constructions"? If I were to invent a name for that it would be ClType.

Stype already appears in some treebanks. I can't tell if it is meant to be strictly morphosyntactic, or refers to the meaning (speech act), possibly signaled by punctuation. "You ate here yesterday?" would be interrogative meaning and intonation/punctuation but declarative word order/marking, so perhaps ClType=Decl|Stype=Int.

amir-zeldes commented 1 year ago

I agree with @nschneid and @dan-zeman - since this is not a feature of words, it should ideally not be in FEATS, and # Stype = or similar does much the same job at the sentence level in several corpora. In English GUM we put sentence forms based on the SPAAC scheme per sentence, which include values like wh for a WH question and q for a polar question, including for questions marked only by intonation.

I also agree with @nschneid that more properly this should be a property of clauses (and in fact this is why GUM inconveniently has a type multiple for when a sentence has two main clauses of different types (e.g. "Did you go and if so, when did you go?")

sylvainkahane commented 1 year ago

Thank for your answers. I adopt PartType=Int for interrogative particles like whether or the French MWE est-ce que, ClType=Int for clauses with a (non-lexically) marked form like Do you agree? or French Es-tu ok?, and SentType=Int for interrogative sentences with or without a marked form, that is, You agree?, as well Do you agree?. As a non-native speaker I have a last question about the normalization of all these features. We already have PronType and PartType. Stype does not follow the same pattern and I suppose that SType or rather SentType will be better. Do you prefer ClType or a more explicit ClauseType?

nschneid commented 1 year ago

I am also contemplating SubClType for subordinate clauses, e.g. ClType=Excl|SubClType=Cont for exclamative content clauses in English (nert-nlp/cgel#10). Hence the brevity of Cl rather than Clause.

One could imagine further Cl-prefixed features based on current features, e.g. ClAspect=Prog (which is expressed constructionally in English, hence we don't mark it on the verb).

Agree that lowercase "t" in Stype is anomalous.

dan-zeman commented 1 year ago

Agree that lowercase "t" in Stype is anomalous.

If I recall it correctly, Stype was an attempt to preserve annotation that existed in the pre-UD treebanks of Hindi and Urdu, I think it was all-lowercase there (stype=declarative), so it was just capitalized in our MISC to make it look more like the rest. Nothing similar was known elsewhere in that-day UD data and no attempt was done at UD-wide standardization. It could be converted to something else of course.

amir-zeldes commented 1 year ago

It could be converted to something else of course.

No objection, I'm happy to switch any data I maintain to SType

I am also contemplating SubClType for subordinate clauses

I think the taxonomy of clause types would be much broader than 'SType's, which only cover matrix clauses (declarative, imperative, interrogative, etc.), so that taxonomy would have to be developed in a universal way first. I also agree with Dan that these are not word properties; The current Stype annotations are conllu sentence level comment annotations. If we did that for clauses, and wanted to avoid putting this on words, we would need to figure out a syntax for referring to spans of tokens in the sentence from sentence comments.

In any case, annotating this for sizable resources would be challenging (unless done automatically), so until someone implements such an annotation project I guess we don't need to worry too much about the details?