UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
272 stars 247 forks source link

Question particles #178

Closed jnivre closed 7 years ago

jnivre commented 9 years ago

Hindi (and I am sure many other languages) forms yes-no questions by adding a question particle to what would otherwise look like a declarative sentence, either sentence-initially or sentence-finally. What syntactic function should we assign to this particle? I can see three possible candidates, but neither of them seems quite right:

mark - this function can definitely be used for something that marks an interrogative clause, but then typically a subordinate one; in the case of main clause questions there is no subordination involved

expl - this could in principle be used for any dependent of the main predicate that does not fill an argument role; but question particles seem rather different from expletive pronouns

discourse - another element loosely connected to the main predicate; but question particles are not really discourse particles, are they?

dan-zeman commented 9 years ago

I vote for mark. I like the analogy to subordinate clauses.

yoavg commented 9 years ago

In Hebrew we use the language-specific label aux:q for that, treating the particle / question word as an auxiliary verb (which makes sense in Hebrew syntax, not sure about the Hindi case).

jnivre commented 9 years ago

Interesting. I will ask our Hindi informant about this. Using "aux" is fine as long as there is some evidence that the thing is a verb, unlike the previous usage of "aux" for "to" in "to err is human", where there is no such evidence.

dan-zeman commented 9 years ago

Hindi क्या kyā is not a verb. In a wh-question it is the pronoun “what”. In a yes/no question it is used as a question particle.

jnivre commented 9 years ago

Thanks, Dan. So perhaps "expl" is not so far-fetched after all ...

yoavg commented 9 years ago

I do not like the expl option, as it seem to diverge to different meaning in different languages. Maybe we should consider adding a particle relation in future UD versions to support such non-verbal particles.

osenova commented 9 years ago

Hi All,

We also have in Bulgarian a question particle ‘li’ for yes-no questions. We had a problem to find a good relation, but for the moment we resorted to using discourse, although it marks not only the whole sentence, but serves also as a focalizer.

Best, Petya

jnivre commented 9 years ago

@yoavg Fair enough. It would be worth looking for other phenomena that could be annotated using the "particle" relation, so that we don't add a new relation for only one construction.

@osenova I think "discourse" is fine for now, but it would be good if you put something about this in the language-specific guidelines, at least in the documentation of the "discourse" relation, possibly also under "specific constructions".

yoavg commented 9 years ago

@jnivre why not? we added nsubj for only one construction... I think that if it is supported by enough languages, there is no reason not to adopt it. An alternative would be to rename expl to something more general, and treat the current "English" expletives as a language-specific subtype of it.

jnivre commented 9 years ago

@yoavg Point taken. But if it is only used for question particles, we might just call it "qpart" or "question" or something. In some ways, it would be similar to the neg relation.

yoavg commented 9 years ago

@jnivre these also make sense. Though I have a feeling that some other particles will crop up from some languages at some point. But I do like qpart in the sense that it is more precise.

Which leads to something I've been thinking about the "language specific" relations for a while: perhaps we need to make a clearer distinction between "language specific" and "elaboration", the later being for cases such as acl:relcl which are actually very common in many (although not all) languages. In this sense maybe we should have particle:question, so that languages that have other particles could revert to just particle, or perhaps extend it differently. (I do realize there is a very fine line between the language-specific and elaboration cases, though, which may be hard to define).

jnivre commented 9 years ago

@yoavg Yeah, this is definitely worth thinking about. It is also related to the general point, discussed earlier, of whether the entire taxonomy should be hierarchical.

tlynn747 commented 9 years ago

Irish has the same question particles.

Mothaíonn tú sábháilte You feel safe

An mothaíonn tú sábháilte? Do you feel safe?

Nach mothaíonn tú sábháilte? Do you not feel safe?

In the Irish Dependency treebank, we use a 'vparticle' label for both of these.

In the Irish UD Treebank, we use a language-specific label 'mark:prt' for the interrogative cases, but we use the 'neg' label for the negative interrogative forms.

We're open to further discussion on these but at the time of mapping from our own label set to UD, we felt that the many various Irish particles didn't fit with any of the main UD labels. In addition, 'mark' is used for an infinitive marker, which is a particle in Irish. Hence our subtype label `mark:prt' covers most of these. Our documentation explains in more detail..

Teresa

2015-05-11 19:52 GMT+01:00 Joakim Nivre notifications@github.com:

Hindi (and I am sure many other languages) forms yes-no questions by adding a question particle to what would otherwise look like a declarative sentence, either sentence-initially or sentence-finally. What syntactic function should we assign to this particle? I can see three possible candidates, but neither of them seems quite right:

mark - this function can definitely be used for something that marks an interrogative clause, but then typically a subordinate one; in the case of main clause questions there is no subordination involved

expl - this could in principle be used for any dependent of the main predicate that does not fill an argument role; but question particles seem rather different from expletive pronouns

discourse - another element loosely connected to the main predicate; but question particles are not really discourse particles, are they?

— Reply to this email directly or view it on GitHub https://github.com/UniversalDependencies/docs/issues/178.

Slán agus Beannacht

jnivre commented 9 years ago

This definitely seems to be something worth discussing for future releases (and revision of the guidelines). Perhaps it is time to make working groups again. But let's do v1.1 first.

ftyers commented 9 years ago

(Sorry for the late-coming!)

Turkic and Uralic languages (and Avar, and probably tons of other languages) have this too.

I don't like the idea of a "particle" part of speech, so having a "particle" dependency relation would be even worse.

In Finnish the "-kO" is clitic, so e.g. in the Turku treebank, it is probably marked morphologically (ex. 106 in the annotation guidelines gives "oliko" the "cop" label). In Estonian "kas" is not clitic, but I can't find out how it is marked in the EDT.

In Turkish the "mI" is not clitic (well, at least not in the orthography) and in the METU treebank it is marked with the dependency relation "Question-Particle". This is the only Turkic language with a treebank I'm aware of.

In the other Turkic languages you have the same or similar morpheme and sometimes it is written without a space and sometimes with and sometimes with a hyphen (Tatar: +мы, Tuvan: _бе, Chuvash: -и). It would make sense to have the same dependency relation for all of these.

My opinion (fwiw) at the moment is that the least bad relation would be "discourse", providing that it doesn't syntactically change the sentence, I wouldn't use this relation for the "ли" in Slavic languages for example.

Sorry if this post has been more confusing rather than less.

riyazbhat commented 9 years ago

Hi all,

Apart from kya, Hindi-Urdu has other particles as well which are mostly emphatic in function and occur freely around the head noun. Mark may probabaly work for kya as it seems to introduce or mark an interogative clause, but emphatic particles like to, hi, bhi etc. modify a noun. Currently I have used dep relation for all the particles in the universal version of both the Hindi and Urdu treebanks. For other Indian Languages like Telugu, Tamil, Kashmiri yes-no question is marked using a clitic which is bound morphologically to its head.

PS. kya is marked as adv in both the Hindi and Urdu treebanks, while in Telugu and Kashmiri its not separated from its head word thus never participates in a dependency tree. Other emaphatic partciles are marked as lwg__rp.

dan-zeman commented 9 years ago

I would use the case relation for to, hi, bhi modifying a noun.

riyazbhat commented 9 years ago

@dan-zeman by using case relation we may lose the distinction between the case-clitics and these particles in the treebanks at the dependency level. Is that fine?

dan-zeman commented 9 years ago

It's true and I'm not saying it's fine because I'm against losing information in general. I'm not sure how much harm it would do... but it's probably better to use case than dep, and I think we occasionally use case for function words modifying nouns, even if they do not fall into the group traditionally described as case markers in the language.

We could also define a language-specific subclass (case:part?) to keep the distinction.

riyazbhat commented 9 years ago

I would prefer case:part, since we do have distinction for other particles like negatives and are discussing a separte relation for question particles. It makes more sense to have a separate relation for these particles as well. However, case would also suffice as we already have a distinction at the pos level for both particles and case markers!!

jnivre commented 9 years ago

What exactly does "emphatic" mean here? Are they related to demonstratives? Or do they mark information structure in some way?

riyazbhat commented 9 years ago

These particles are usually refered as "emphatic" as they lend emphasis to the modifying noun. I guess they are relevant to the information structure by probably marking the focus. Some of them are like English adverbial "only" (qauntificational function). In that sense, they are more like demonstratives.

jnivre commented 9 years ago

Thanks. Not an easy choice (and I guess it points to a gap in the taxonomy). I could imagine using dep:mark rather than case:mark, but I am not sure we should encourage language-specific subtypes of the negatively defined dep relation.

dan-zeman commented 9 years ago

@riyazbhat : I should have thought of this earlier. The semantics of these particles seems to overlap with something we actually had in source Czech annotation as well. I defined the language-specific relation advmod:emph to keep the distinction (see http://universaldependencies.github.io/docs/ext-dep-index.html or directly http://universaldependencies.github.io/docs/cs/dep/advmod-emph.html for examples). Perhaps we could use the same label in Hindi and Urdu?

riyazbhat commented 9 years ago

@dan-zeman Can you use "zvlášť " as a clausal adverbial and is it always local to its modifying head? Could you please share some more examples with me. In Hindi, focus can either be marked by local or non-local adverbials and by emphatic particles. For the former advmod:emph is perfect. The later, however are postpositional and are never treated as adverbials in linguistics literature on Hindi syntax.

dan-zeman commented 9 years ago

Yes, zvlášť can be used as a clausal adverbial. It will lead to a different label in the source annotation, which translates as advmod in UD. What does it mean that it "is local to its modifying head"?

Here are some more examples of advmod:emph in Czech:

Mohli by obvinit i některého ministra. “They could prosecute also/even a minister.”

Začnou o měsíc později. lit. They-will-start even by month later. “They will start one month later.” ( expresses that the speaker or the listener did not expect the thing to happen that late.)

Ani vojáci o to nemají zájem.Not even soldiers are interested in it.”

Hraje v sobotu. “He will play already on Saturday.”

Chceme se sejít ještě tento týden. lit. We-want to meet still this week. “We want to meet before this week ends.” (I also found occurrences of ještě that were clausal modifiers and yet were annotated advmod:emph.)

u asi 20 titulů “by around/approximately 20 items”

Dá se to dokumentovat právě na početné skupině dětí. “It can be shown just on a large group of children.”

I cannot judge whether advmod:emph can be reasonably applied to the emphatic postpositions in Hindi. I can only point out that in UD we sometimes have to suppress the traditional grammatical terminology of the language, in order to get similar phenomena in two languages under the same term (e.g., we use determiners in Slavic languages even though the category is not used in linguistic literature on these languages). But there definitely is some limit to doing so, and I know too little about Hindi to be able to tell where the limit is. As advmod:emph is a language-specific (rather than “universal”) relation, the urge to reuse it in other languages is also weaker.

riyazbhat commented 9 years ago

Thanks for the examples. The emphatic adverbials and emphatic partciles differ in their distribution in Hindi and Urdu. Emphatic adverbials and their head can be intervened by other constituents/phrases. However, in case of emphatic particles, they always stay adjacent to their head like case markers. This is what I meant by "local to its head". Its hard to infer from your examples whether Czech particles differ in their distribution. I will be using admod:emph for emphatic adverbials. For other particles, I will wait till we reach a consensus on a particular label.

dan-zeman commented 9 years ago

I have not done detailed research but I believe that the Czech emphasizers more or less always immediately precede the noun phrase (or prepositional phrase) they emphasize. They are not necessarily adjacent to the head noun because there may be intervening prepositions and adjectives.

coltekin commented 9 years ago

On the original subject of question particles, I would also appreciate a common solution.

For Turkish, the solution suggested by @yoavg looks most appealing to me. The question particle in Turkish acts somewhat like a verb, it can carry some of the tense/aspect/mood suffixes as well.

My preference would be to mark it as AUX and indicate that it introduces a question by a feature (PronType=Int maybe, but I also think PronType is too overloaded.) and mark the relation to the main predicate as aux, or specify further with aux:q.

The second best solution is probably discourse, but the question particle is not like other loosely related/connected discourse markers. And, in fact, its function is not a lot different than modal auxiliaries.

dan-zeman commented 8 years ago

As there is UD_Hindi in the meantime, I searched it for क्या (see http://hdl.handle.net/11346/PMLTQ-9VJT) and found out that it is tagged PRON and attached as advmod. I am not sure whether this is the best solution though.

jnivre commented 8 years ago

PRON seems fine because (if I understand correctly) it is an interrogative pronoun, and "advmod" is not totally out either, since this is the function that negation would have if there wasn't a specialized neg. Of the syntactic alternatives proposed so far, I would argue against "mark" and "discourse". For "mark", the problem is that this is generally a marker of subordination, and there is no subordination here. For "discourse", the problem is that this is generally take to signal some non-propositional meaning, and question status is at least closely related to propositional meaning. I think I would prefer "aux" in cases where the question particle behaves as an auxiliary verb, and "advmod" otherwise.

dan-zeman commented 8 years ago

There is an interrogative pronoun homonymous with this word. This is not the same as saying that the word is an interrogative pronoun (in any context), and I am not sure I want to say the latter. It is an instance of our form-function dilemma. Also, if it is PRON, wouldn't we use nmod instead of advmod?

I tried to do a survey of treebanks where we have question particles and what we know about them and/or do with them:

Hindi क्या / kyā is either an interrogative pronoun (or determiner), or an interrogative particle (Vincenc Pořízka: Hindī Language Course. SPN, Praha, 1986, pp. 330 – 332, § 125). According to Jain 1995 (Usha R. Jain: Introduction to Hindi Grammar. University of California at Berkeley, 1995, pp. 61 – 62), the question word क्या in yes/no questions is unstressed, “serves simply as a question marker and cannot be translated into English.” In information questions it is stressed and “is the equivalent of the English what.” I would prefer to take the function into account in this case, and use different POS tags in the two cases. For me, this looks similar to Spanish where que is either a subordinating conjunction, or an interrogative/relative pronoun/determiner. Most occurrences in UD_Hindi 1.2 are tagged PRON (103), some DET (25) and ADV (4).

Hebrew: interrogative particles are morphologically similar to verbs and they are analyzed as auxiliary verbs (AUX + aux:q, a language-specific subtype). This could probably work for Turkish, too.

Japanese: interrogative particle か / ka is in the universal guidelines explicitly mentioned as an example of PART. However, the Japanese documentation of PART does not mention it and the data does not contain words, so we cannot check how they are annotated. In the Japanese documentation of aux (http://universaldependencies.org/ja/dep/aux_.html) they say that it is attached as aux. They call it “question particle” but it is not apparent whether they really tag it PART. Statistics don’t seem to confirm that, there is no PART as dependent node, so maybe it is AUX?

In Bulgarian (see http://universaldependencies.org/bg/dep/discourse.html) the question particle ли is tagged PART and attached as discourse. But Bulgarian also attaches certain particles as aux. We should get straight whether anything other than AUX (e.g., a PART) is allowed to be attached as aux.

Croatian does not have documentation but according to PML-TQ (http://hdl.handle.net/11346/PMLTQ-AWTU), li is particle and attached sometimes as discourse, sometimes as mark.

Arabic هَل / hal is PART + aux. Another Arabic interrogative particle is أَ / ʾa and it is tagged PART. In the beginning of the sentence it is attached as cc. [ar] هَل تَتَعَمَّقُ اَلأَزمَةُ اَلسِّيَاسِيَّةُ وَ اَلِاقتِصَادِيَّةُ فِي لُبنَانَ ؟ / Hal tataʿammaqu al-ʾazmatu as-siyāsīyatu wa al-i-ʼqtiṣādīyatu fī lubnāna? “Does deepen crisis political and economic in Lebanon?”

jnivre commented 8 years ago

I have changed the milestone to v2, since it seems we haven't reached consensus yet. I am assigning it to myself for now but would be happy to give it to someone else.

vinbo8 commented 8 years ago

Chiming in here with another parallel - I'm working on UD for Marathi (an Indo-Aryan language), and the question particle का is similar to Hindi, though as a wh-question, it's "why", not "what", which is more adverbial than pronominal. I was contemplating using advmod when I came across this thread - it seemed okay, given that the word is essentially an interrogative adverb.

There is a semantic difference depending on whether it occurs before or after the head verb, though - before the verb, it indicates a wh-question, whilst after, it's a question particle. For instance, contrast तो का जेवतो? to kā jevto? "Why does he eat?" / तो जेवतो का? to jevto kā? "Does he eat?". The former seems more suited to advmod than the latter. How about mark?

jnivre commented 8 years ago

Thanks for chiming in. I would advise against "mark", which is a marker of subordination. I assume we are talking about main interrogative clauses here. For the time being, I think "advmod" is the best choice. However, I think we need to think more about this for v2. In particular, I think we should have a systematic strategy for annotation mood (declarative, interrogative, imperative) and perhaps also polarity (affirmative, negative). It seems that all languages have means to form questions and negate sentences, and we want to be able to make systematic comparisons between them. Also from a more practical NLP perspective, knowing whether a sentence is declarative or interrogative, affirmative or negative, can be of paramount importance.

nschneid commented 8 years ago

I think we should have a systematic strategy for annotation mood (declarative, interrogative, imperative) and perhaps also polarity (affirmative, negative).

+1. Two related talks that I saw in Berlin:

The second paper observes that sentence type (declarative/imperative/question/etc.) has a substantial effect on accuracy of NLP systems, including dependency parsers. Including this in the UD parse (perhaps as a refinement of root and ccomp) could actually improve accuracy throughout the sentence by forcing the parser to take sentence type into account.

amir-zeldes commented 8 years ago

Thanks for bringing this up @nschneid . I agree that sentence types can give very important information and that they can be useful for parsing, but I don't know if I'd go so far as to put sentence types on the sentence or clause root labels (if then as a sublabel?). The fact that something is ccomp seems more relevant than whether or not it's a question, and I'm also not sure that that's always strictly 'syntactic' (many verbs of saying can take either a complement question clause or a declarative).

Generally I think adding these types to the universal inventory would mean making a statement about universal sentence types, which I'm not sure is something everyone is ready to get into. I should also point out that some of the sentence types in the paper @nschneid mentioned are more semantic, including 'modal' sentences, while others were more syntactic, such as 'infinitivally headed fragment'. Standardizing which kinds of sentences make sense across languages could be tricky (e.g. some languages have no finiteness, or no clear distinction between subjectless 'fragments' and pro-drop, etc.).

My own preference would be to add them optionally as a sentence level annotation, i.e. as a comment with a # if needed, e.g.:

# s_type=imp
1   Be  _   VB  VB  _   2   cop _   _
2   careful _   JJ  JJ  _   0   root    _   _
3   .   _   SENT    SENT    _   0   punct   _   _

In the case of imperatives there's also a connection to speaker annotation, in the sense that one could conceivably think of annotating not just the speaker, but also the addressee.

nschneid commented 8 years ago

My own preference would be to add them optionally as a sentence level annotation

Wouldn't this neglect information about embedded sentences, e.g. reported speech? "'Why are you smiling?' John demanded." vs. "'Give me that cookie,' John demanded."

amir-zeldes commented 8 years ago

Yes, I agree that's the most compelling kind of case for putting it on the label. I do like this annotation layer a lot, don't get me wrong, I'm just wondering if it's something everyone will get on board with annotating, since it's a further level of information on some level.

Would you still want to keep ccomp as the primary label in the above cases? I would definitely like to have an easy way of finding all complement clauses.

nschneid commented 8 years ago

I'd be fine with root:imper, root:q, ccomp:imper, ccomp:q, etc.—assuming two of the sentence types are imperatives and questions. We could keep plain root and ccomp for declaratives.

spyysalo commented 7 years ago

Closing as there is no recent activity and the v2 guidelines are now being published. Please consider opening a new issue with reference to the new guidelines and this discussion if there are open questions relating to this issue.