UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

`Voice` feature (for Turkish) #197

Open coltekin opened 9 years ago

coltekin commented 9 years ago

Current Voice specification in UD seems have paid some attention to Turkish, but I still cannot fit the possible voice values into the Voice feature as specified.

Turkish verbs may inflect for four voice values, reciprocal, reflexive, causative and passive. Except for reflexive, these are already listed in current UD specification. The problem is with multiple voice values. A verb may be suffixed with multiple voice suffixes:

I can not see a way to specify the features for the cases like the last one (which are not uncommon at all). I see that multiple feature values are possible, but specification states that Voice=Cau,Pass specify alternatives, that is, only one of these values is correct, not both. Multiple voice values are possible in other combinations too, e.g., Rcp and Pass or Caus may combine. Furthermore, not only Voice may have two distinct values, both passive and causative suffix can be repeated. So, we may have a verb with reciprocal-causative-causative-passive-passive inflections (see below for what that means).

METU-Sabancı treebank introduces a new inflectional group with every voice suffix. So, there are no overlapping feature values. I am in favor of treating voice as an inflection (not introducing new IGs) but we need a solution to express the cases above. I am aware of "layered features", but as far as I understand this does not fit into their use case either. The verb is not causative for some relation, and passive for another, it just has both "features".

My suggestion is promoting the 'voice' values to features, e.g., introducing features like Causative, Reciprocal, possibly overloading Reflexive, and maybe Passive, too. This does not solve the multiple Caus and Pass issue, but then, we can at least introduce some (language-specific) convention, e.g., marking double causative with Caus2 and triple causative with Caus3 etc.

The issue ends here. The rest is for background information and further discussion. Please feel free to skip.

There are more issues, especially around causative. Such as how to mark the subject of a causative verb. Like subject of a passive verb, which would be marked nsubjpass, subject of a causative verb is not the thing actually carries out the action. I saw some discussion of this in Japanese UD documentation. A language-general solution to this may be beneficial for other languages too. But let's keep this issue about multiple values. I will definitely bring the causative subject issue again, unless someone points me to a clear solution.

Background: multiple causative and passive

In Turkish, causative suffix can be attached to a verb multiple times.

Naturally, things get incomprehensible after a couple of them, but there is no clear reason to stop at an arbitrary point either.

Multiple passives, or passivizing an intransitive verb may not seem to make sense, but it happens in Turkish. These create so-called "impersonal passives". For example, buradan düşülebilir 'here.ABL fall.PASS.ABIL.AOR = one may fall from here' (see Göksel & Kerslake 2005, p.136).

Tugbapmy commented 8 years ago

It seems like a bit of a a radical change to promote these voices to features, for the sake of Turkish. The idea for multiple Caus and Pass voices is interesting, but the order of these voices in Turkish is not trivial. Perhaps voice features, as opposed to values seperated by comments, could be multiple, showing them in the order in which the suffixes have been attached. Boya-t-ıl-t- would roughly mean "it was made to have been painted", The voice features would be Voice=Caus|Voice=Pass|Voice=Caus.

ftyers commented 8 years ago

This is not just relevant for Turkish, but also for all the other Turkic languages. However, I would probably consider erring on the side of creating new inflectional groups for valency changing verb derivation, where not lexicalised.

coltekin commented 8 years ago

[small keyboard/screen: apologies for potential formatting errors/typos]

Although I had also been quite convinced that the voice suffixes should introduce new IGs (see for example this comment), I do not think the current proposal is a "radical change". In fact, all grammars I am aware of define voice suffixes as "inflections", not derivations (or special suffixes in any other way).

I do not find valency change enough reason to introduce a new IG. The valency requirements of the inflected verb can always (well, almost) be inferred from the number of valency-changing inflections on it. And, this is probably all we care for syntactic analysis. (Furthermore, for Turkic languages, since any argument that can be inferred from the context may be elided, I am not sure if we get much information from valency in syntactic analysis either)

I think, two clear reasons for splitting words are, (1) parts have their own "inflections", (2) they participate in distinct syntactic relations. I think voice suffixes fail on both:

These being said, there may also be good reasons for splitting voice suffixes. The examples of the sort given by @Tugbapmy above is probably one of them. However, as a native speaker, after thinking hard about it for a while, I am still confused about the distinction between boya-t-ıl-t-tı *paint-CAUS-PASS-CAUS-PAST" and boya-t-tır-ıl-dı "paint-CAUS-CAUS-PASS-PAST" (I tried to tease them apart at the end of this comment for those who are curious, myself included). I do not think the order here matters a lot in practice. Furthermore, I could not find a single instance of the CAUS-PASS-CAUS combination in a web corpus consisting of 900M tokens. The CAUS-CAUS-PASS seems to be common (also note that double causative is often used emphatically, with the same meaning as a single causative).

Another reason I can think of is to keep things parallel with the other languages, where similar structures are formed by relations between different syntactic units. However, I do not find this very convincing either. With the same reasoning, we could also argue that we should split the person agreement from finite verbs (they correspond to obligatory pronominal subjects in some languages).

In summary, although I am not very attached to the idea of not splitting voice suffixes, I do not see a very convincing reason for splitting them either. I think if we want to limit the number of sub-word units, voice suffixes are good candidates to be packed together with the main verb.

boya-t-tır-ıl-dı vs. boya-t-ıl-t-tı (for the curious)

Both forms are causativized twice, and passivized once. So, there are three actors, one doing the painting, one causing that person to do the painting, and another one causing the second person to cause the first person do the painting. The difference, seemingly, is the actor that the speaker does not reveal by passivization. To make it more concrete here is a scenario (full of stereotypes): Ayşe wants the house to be painted, so she ask her husband (Ali), and Ali, instead of doing it himself, pays Ahmet to do the job.

My conclusion is we would not loose much if we assumed 2xCAUS+PASS in both cases, disregarding the order.Unless list-valued features are allowed in UD, we cannot just pack these into a single feature, but we need multiple labels. In that case Causative=2|Passive=1 might work. But, of course we can avoid all these (for Voice) by introducing new IGs for voice features as it is typically done in Turkish NLP since METU-Sabancı treebank.

dan-zeman commented 8 years ago

As I (a non-speaker of Turkish) see it, introducing new IGs (aka UD syntactic words) would help solve the problem but it would be a rather technical hack that at least @coltekin finds to poorly describe the inflectional nature of voice.

Well, here is another possible hack. Select one of the voices (perhaps the last one?) as the main one and create a new, language-specific feature for the rest. Users not specifically interested in / or knowledgeable about Turkish would just drop the language-specific feature. They would get some Voice information, it would not tell them much, probably it would just warn them that the valency is altered somehow. The other feature could be called Voice[add] (using the layered feature approach). I know this is different from the other cases for which we have used layered features but I do not see why not use them here and the current UD guidelines do not prevent us from doing so. The whole feature layering mechanism is just a technical device anyway.

Now there is the question how to deal with a number of feature values that is in theory unlimited. I would not use the "multivalues" because elsewhere they are treated as sets, i.e. no value can occur twice and values are always ordered alphabetically. Neither do I like fixing the maximum number of values per word. From Çağrı's examples I got the impression that 3 would be enough, but if it is recursive, then it just does not seem right to set a fixed limit. That leaves us with one option, and it is unlimited number of layers. So what about adding Voice[add1], Voice[add2] etc. with the numbers going as high as needed but probably not higher than 2. Because we still have the "default layer", Voice without specifier. The Turkish- (or Turkic-) specific documentation would have to say whether the default voice is the first or the last one, and then the additional layers would be numbered from that point away.

dan-zeman commented 8 years ago

Another comment I have involves the Reflex voice. The original idea in Interset was that the Reflex feature, now only used for pronouns, could also be used to mark reflexive verbs. But reflexive verbs in Indo-European languages could be quite different from Turkish. There usually is the reflexive pronoun anyway, either as a separate word, or as a clitic incorporated into the verb. (And we will separate the incorporated clitics in UD again, so it will not be the verb what will bear the feature.)

I am starting to think that maybe it would be better to have the Reflex value for Voice as well. It would keep the voice system of Turkic languages together and it would probably look better than keeping one value as a separate feature. What do people think?

jonorthwash commented 8 years ago

The Apertium Turkic group considers so-called "reflexive voice" in most Turkic languages, including Turkish, to be derivational. It is not productive, and occurs only with a limited list of verbs. Turkic languages, like many Indo-European languages, mostly use reflexive pronouns for real reflexive meanings.

jonorthwash commented 8 years ago

As far as recursion, passive and causative are not infinitely recursable. In principle I would say it is a recursive system (you create a new verb stem, and can start over from there), but like most recursive systems in language, if you get more than a few layers deep, people don't deal with it well—they don't produce such forms themselves, and they parse them only with difficulty. That said, I think your proposal is sensible.

gulsenceb commented 8 years ago

I agree with @jonorthwash's comments (both for the reflexive voice to be considered as a derivation and the recursion). The reciprocal voice may also be treated similarly.

Voice[addn] type of additions seems also ok to me.

jonorthwash commented 8 years ago

I think there's a lot more variation in Turkic languages concerning reciprocal voice. In Kyrgyz, I would say it's reasonably productive, but I don't know Turkish well enough to speak to its use there. In Kyrgyz, it also has additional meanings, including that of cooperative voice. So forms can be ambiguous (depending on the verb's semantics) between doing something for/to each other, doing something to assist someone else, and doing something together.

Also, how reciprocal/cooperative interacts with passive, causative, etc., appears to be not fully understood in Turkic languages, and may vary by language. So to avoid having to struggle like this to arrive at a different solution for a slightly different pattern in a related language, we should be thinking as broadly as possible regarding solutions.

coltekin commented 8 years ago

On productivity of reflexive and reciprocal suffixes

I agree with the earlier comments that (for Turkish) the verbs that can be made reflexive and reciprocal are rather few, and if one choses to do so, it can be treated as non-productive derivation, and can even be listed in the lexicon.

However, even if they are "lexical features", they are still features. I think there is a need for indicating these features, precisely for the same reason we would like to indicate the other voice features: words with those features assign different interpretations to their arguments.

Reflexive verbs and reflexive pronouns

Indeed, what we mean by a reflexive verb in Turkish is quite different than in other languages where a reflexive pronoun is required with a reflexive verb. In Turkish, a reflexive verb just indicates that the subject and the direct object are the same, without a requirement for a reflexive pronoun in the sentence.

Turkish also has a reflexive pronoun, and except these few verbs, making an action reflexive requires the use of the reflexive pronoun. In fact, one can combine a reflexive verb with a reflexive pronoun object, too. But the interpretation is different (the pronoun in that case is only used for emphasis).

Feature structure

I think we all agree that the current Voice specification does not cover Turkish (and likely other Turkic languages) well. These features are classified as single linguistic feature since they all change the interpretation of the arguments of a verb. In a nutshell: in case of passive we have a non-agent subject, causative introduces multiple agents where subject is not the agent of the main verb, reflexive forces subject to be the direct object, and reciprocal introduces multiple agents. However, their use, to some extent, is orthogonal to each other, a verb can carry multiple voice features. So, a single Voice feature with a set of alternative values is not sufficient.

I like the idea of layered features, to recognize their relation, but allow them to vary individually. The "hack" suggested by @dan-zeman would describe arbitrary combinations of voice suffixes, and it also covers the cases where order of the suffixes matter. My worry is, these may not be useful as machine-learning features, or it my not be easy to form useful queries. One can of course convert them to more directly useful features, but I think part of the aim of having unified annotation is to lift this burden off from the researchers doing research on large number of languages.

My current understanding is that we can have four separate features (or layered/related features of the form Voice[cau], Voice[refl], etc), with either boolean or numeric values (which would be safe to interpret as categorical). This would not be as flexible as Dan's suggestion, but it would be more readily usable for parsers and other potential machine learning applications. If we are fine with numeric features, this would also cover recursive causative, and in combination with the verb's transitivity, we can also tell whether the verb carries an (ordinary) passive feature, or "impersonal (passive)". Alternatively passive-impersonal distinction could be marked during annotation explicitly. Of course, this would work fine in case the order of suffixes are not significant. If we have evidence that the order matters, maybe additional IG is not that much of a hack at all.

As a side note, at them moment, Voice feature in UD, if used at all, seems to be almost synonymous with passive. Only language that document use of a value, Cau, other than Act and Pass is Hungarian but the documentation is rather brief. If we want to keep the compatibility with the current treebanks, we should also keep this in mind.

Features vs. affixes

I came across this issue in a few other places too, so I think it might be a good idea to mention / ask for opinions. Although Turkish has a relatively transparent mapping between suffixes and their functions, even for Turkish one often cannot map a suffix directly to a useful feature. Relvant to the current discussion, a second causative suffix is redundant in most cases. The interpretation of two causative suffixes is the same as a single causative suffix (possibly with added emphasis).

In such cases should the annotation reflect what is on the surface (two causative markers) or what the interpretation is (single causative)? Is there any UD guidline/standard on this, or is it left as a per-language or per-treebank decision?

dan-zeman commented 8 years ago

There is no UD guideline that the features shall reflect the surface affixes (and some universal features do not map to any affix because they are lexical). This would be a language-specific decision. Non-agglutinating languages often use one affix to express several different features at once. But you may want to discuss this with the Finnish and Hungarian groups, maybe they had similar needs and discussions.

As for the feature structure, I do not like Voice[cau] and Voice[refl] etc. with numeric values. This inverted perspective does not fit well in the broad picture of how universal features are used. As a side effect, their base name (without layer) should then be something else than Voice because the feature values would be completely different from the values of the “universal” Voice feature.

coltekin commented 8 years ago

Thanks for the clarification on features vs. suffixes. I guess there is no "pure" agglutinating - non-agglutinating language difference after all. Even for Turkish which is generally considered as a typical example of agglutinating languages, there are suffixes that express multiple functions/features. For Turkish, a direct mapping from the morphemes give a good first approximation to the morphological features. In some cases one needs to be careful since some morphemes have multiple functions, and some function differently based on the morphological context, but in some other cases more annotation effort is needed to get them completely right (e.g., whether a double-causative construction expresses double or single causation).

I see the reverse logic between the layered features and this particular problem. I think this because the mechanism is not a perfect fit for the problem. The ideal solution (in my opinion) should,

  1. express multipe voice values simultaneously, allowing multiple values for the same voice.
  2. relate these features somehow as "voice" features, so that one can expect changes of valency and/or interpretation of the arguments.
  3. not be too difficult for a linguist to understand and use. For example if we want to search a treebank for double-causative and passive verbs with a dative adjunct. It should not be too difficult to express this in a treebank query language. (e.g. assuming TigerSearch-like query, something along the lines of [pos="VERB" & Voice:cau > 1 & Voice:pass > 0] > [pos="NOUN" & Case="Dat"] would work fine, but I cannot imagine this with the Voice[add] approach, although I admit that it is more expressive)
  4. (similar to 3) allow a machine learning method to use the features as is without lots of language-specific feature engineering going into the process.

I thought a notation like Voice[cau] fits into the my wishes above, and it would work fine as long as we acknowledge that this is some sort of subtype rather than a 'layered' feature. As Person refers to person agreement between subject and predicate, while Person[psor] refers to the person agreement between a noun and its possessor, Voice could refer to passive-active distinction (after all, this is what Voice means currently), while Voice[cua] can indicate causativity. But I am happy with any working solution that allows us to go forward.

Introducing language-specific features for anything that does not fit into current UD specification is tempting, since that would allow us to work faster. However, I do not like giving in quickly, as I think the main idea of the whole effort is to try to unify the annotations as much as possible. This also means changing the UD specification instead of trying to fit a language into the current UD scheme as much as possible and describe the rest as language-specific features. I definitely understand being conservative, since the existing treebanks may also need changes when UD spec changes, but as long as the changes that can be done without manual effort that should not be a serious problem.

I am still not convinced about a single/particular solution. I am listing a couple of options below, with the hope that they may stimulate further discussion.

dan-zeman commented 4 years ago

@coltekin : I am not sure what the status of this issue is. It is more than 4 years old so it is quite possible that a common solution has been found and implemented in the meantime; on the other hand, your last post is quite open-ended. I am tentatively moving it to a new milestone but perhaps we can close the issue?

coltekin commented 4 years ago

I do not think we have a solution. The way we currently annotate similar "multi-valued" features (something like Voice=CauPass) keeps the information around, but makes the use of this information difficult (e.g., most systems/people will not associate CauPass with Cau or Pass).

If there will be a major revision of morphological features (I vaguely recall plans of unifying with UniMorph), I think we should keep this problem in mind. Otherwise, I do not think we have any clear immediate solution to this problem - we can close the issue for now.

dan-zeman commented 4 years ago

A major reworking of the system of morphological features would probably have to be part of a new major version (v3) of the UD guidelines. I am going to change the milestone to "later" so that we do not have to process the issue during every release but hopefully it does not get forgotten when we discuss UD v3.