UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 244 forks source link

Question particle and deprel #738

Open rueter opened 3 years ago

rueter commented 3 years ago

The question is one of the traditional 4 sentence types, where we have elicitation of information.

In POS PART, the examples include the Question particle. In earlier work I have opted to annotate using the AUX part-of-speech marker and a dependency aux:q (UD_Erzya-JR)

Since question particles in at least some languages might not necessarily depend directly upon the verb, use of PART would be in line with UD documentation. An extension of this issue, however, is what the dependency relation should be.

In the instance of Erzya, the question particle ли is a loan word, so my first reference would be to the Russian counterpart.

The UD_Russian-SynTagRus; # sent_id = 2003Uteshenie.xml5 # text = Получит ли он удовольствие от предлагаемого ему путешествия в страну стиховедения? 'Will he enjoy the proposed trip to the land of poetry?'_ we see that ли is marked Pos=PART with a deprel advmod.

The Estonian EWT UD # sent_id = ewtb1_10002631 # text = Kas mul on õigus??? 'Am I right?'_ annotates kas Pos=ADV with a deprel advmod

The Hungarian UD # sentid = train-28 # text = Felvetődik a kérdés: vajon végtelenné válik? 'The question arises: will it become infinite/endless?'_ annotates vajon Pos=ADV with a deprel advmod:que The same annotation is used for the Hungarian hyphenated -e

The Hebrew HTB # sentid = 863 # text = האם הזמן קובע מורטוריום ל"יזכור"? 'Does time set a moratorium on "remembering"?'_ annotates האם Pos=ADV with a deprel mark:q

The Japanese GSD # sentid = train-s442 # text = そうですか。 'Is that so'_ annotates Pos=PART with a deprel mark

The Turkish IMST # sentid = mst-0331 # text = İşe yarar mı ki... 'Does it work...'_ annotates Pos=AUX with a deprel aux:q

No, I do not have a command of all of these languages, but I assume that these questions to a great extent elicit yes/no answers.

The question particles of a few selected UD projects are annotated as: POS= ADV, AUX and PART dep= advmod, advmod:que, aux:q, mark:q, mark

What should we work towards? PART & mark:q ?

ftyers commented 3 years ago

The difference in Turkish is that the copula can follow the question word and is written contracted with it,

geliyor musun gel-iyor mu-sun come-PROG QST-SG2

I believe this is motivating the AUX analysis -- otherwise it would be PART. Although musun could equally well be split into PART + AUX.

For most languages with words like ли, mI etc. probably PART is the right part of speech (one of the very few instances where PART seems like a potentially reasonable category).

Note that Finnish has these too -kO, but they are dealt with using a feature Clitic=Ko, although I do not like this solution. They are also in North Sámi, -go or go where they may appear attached or not and are annotated with PART and discourse.

As for the relation, our options seem fairly limited and unsatisfactory:

discourse: This is used for interjections and other discourse particles and elements (which are not clearly linked to the structure of the sentence, except in an expressive way). We generally follow the guidelines of what the Penn Treebanks count as an INTJ. They define this to include: interjections (oh, uh-huh, Welcome), fillers (um, ah), and non-adverbial discourse markers (well, like, but not you know or actually).

I'm not sure if we could say that the question words are "only linked to the structure of the sentences in an expressive way". However, question potentially they could be treated like tag questions, no?

mark: A marker is the word marking a clause as subordinate to another clause. For a complement clause, this is words like [en] that or whether. For an adverbial clause, the marker is typically a subordinating conjunction like [en] while or although. The marker is a dependent of the subordinate clause head. In a relative clause, it is a normally uninflected word, which simply introduces a relative clause, such as [he] še. (In this last use, one needs to distinguish between relative clause markers, which are mark, from relative pronouns such as [en] who or that, which fill a regular verbal argument or modifier grammatical relation.)

This relation is really aimed at subordinate clauses, so for example the ли in Трудно сказать, поймут ли они установки этого задания. But the ли in Жалею ли о чем? is traditionally treated as a different part of speech and is clearly in a main clause.

advmod: An adverbial modifier of a word is a (non-clausal) adverb or adverbial phrase that serves to modify a predicate or a modifier word.

The issue here is that particles can often modify more than just verbs, they can be attached to anything to indicate that it is the focus of the question, so not just a predicate or modifier.

On balance, I think that discourse is probably the best bet, although we would need to adjust/bend the guidelines a bit regarding the clarification of "which are not clearly linked to the structure of the sentence, except in an expressive way".

sylvainkahane commented 3 years ago

Above all, what we need is a feature on the question marker indicating that it is a question marker. We have the feature Polarity=Neg for negation markers, but it doesn't seem that we have an equivalent feature for yes/no question markers. Is it true? The only thing I find on the feature page is a PunctType=Qest for punctuation signs.

amir-zeldes commented 3 years ago

I think it's not necessarily a problem that the POS tags vary by language. POS is a morphological category, so it is also determined by morphological criteria, and not necessarily functional ones (i.e. whether a word functions as a question marker in context). As for deprel, it's a little trickier, but I can imagine some reasons, based on the syntax of the construction embedding the particle.

For the Hebrew example I am guessing the reasons for mark are that 'ha'im' is the same word used for reported conditionals, and is morphlogically related to the word for 'if'. So it is basically the Hebrew equivalent of 'whether', and using it as a question marker is similar to the use of main clause 'ob' in German:

Similarly for Japanese, 'ka' can introduce a reported question, so it is standing in the typical mark position, like the quotative 'to' and similar words. For other languages, the question particle has nothing to do with the typical forms of subordination in the language, so there's less reason to think of mark etc. But I think that's OK - the guidelines for each language reflect the kind of syntax that these words have.

I agree with @sylvainkahane that a feature could be useful for cross-linguistic comparison and in many languages could be added relatively easily, though since xpos treats both the main and subordinate versions of words like German 'ob' the same, it's not 100% trivial to give them different feats without manual inspection (maybe using a heuristic looking for a subsequent question mark?).

dan-zeman commented 3 years ago

This issue is a duplicate of #458 and #178 (also related is #454). Regrettably, the outcomes of those discussions have not made it to the documentation, and issue #458 is still open. Nevertheless, I believe (without re-reading the entire threads now) that the conclusion was that a question particle should be attached to the head of the clause as advmod unless there are good reasons to do something else. The mark relation is for subordination.

@sylvainkahane : The language-specific feature PartType=Int (interrogative, as with pronouns) has been used together with the PART UPOS tag.

Stormur commented 3 years ago

Coming late with some remarks.

In general, I agree with @ftyers and the analysis of such "interrogative particles" as PARTs depending as discourse, and this is actually what we are implementing in our Latin treebanks. Further, we are using (better: re-using) the feature PartType=Int, mentioned by @dan-zeman, although I find it unsatisfactory, for reasons that I explain below.

Now, more details about Latin... principally we have the clitic element -ne, which may be present in a yes-no question (and which probably comes from the identical negation, in a sense like "... isn't it?"). As many other "particles", it attaches to the second element of a clause:

meministine me in senatu dicere? 'do you remember (meministi) me saying [that] in the senate?'

The fact that it is not mandatory in such questions in Latin and that it contributes to, but does not determine, the emphasis of the focused word (the one put at the beginning of the clause), seems to put it alongside other discoursive markers (like quidem '~ indeed, truly') that have the same syntactic patterns (Wackernagel), and hence the preference for the discourse deprel.

I would refrain from advmod because I don't see exactly the "modification" it brings to the predicate: roughly and naively speaking, in my opinion it does not express a particular "modality" of the predicate, it just gives the whole clause a particular "tone"; we are approaching the realm of pragmatics.

From what is shown in the original post, I think this can be argued for most of these interrogative elements, be it Finnish (very questionable treatment of -ko/kö) or Turkish (like @ftyers points out, the AUX is clearly the -sun part, not mu). For example, although I admit my ignorance here, the annotation of the Japanese ka as a mark seems to me an unjustified reinterpretation: in my experience, I see that very often indirect questions just appear as clauses more or less juxtaposed to the main clause. On the contrary, examples such ob er kommt are just subordinate clauses taken alone, so here we really have marks. But these are just my 2 cents on such cases.


In general, the advmod relation seems overloaded and overextended to me. Is negation (I don't go) really a modification of the predicate in the same vein as an adverb like fast (I go fast)? Could we postulate a very specific, but important, negation relation; or delegate something more to discourse; or maybe find a good label for rather pragmatic, but more "syntax-bound" than interjections relations pertaining to elements which express some kind of pragmatic-oriented "polarity", like negation, "interrogation", emphasis? Such a relation would probably be mostly used in conjunction with PARTs. Parallelly, as mentioned by @sylvainkahane, as we already have a feature like Polarity=Neg, could we implement others like Interrogative=Yes or Emphasis=Yes (cfr. #741)? In this sense, a PART would become a kind of more or less bound morpheme which "lends" a feature to another element, be it a word, a phrase or a clause/predicate. I see this concept as distinct from what I understand advmod to be. This would also be much more generalised than a POS-taylored feature like any PartType (as for example, I don't know, an emphatic form of a noun, or I can also envision an interrogative mood of a verb...).

So this was it, please be kind with my adventurous proposals! :slightly_smiling_face:

dan-zeman commented 3 years ago

We already had a neg relation in v1 guidelines and we removed it in v2 on the ground that its difference from other modifiers was purely semantic :-)

On the other hand, something like Interrog=Yes or interrogative mood was briefly discussed for v2 guidelines but it was postponed because we did not see it attested in the data we had at that time.