Representation of `zero morphemes' in tokenization

UniversalDependencies / docs

Universal Dependencies online documentation

http://universaldependencies.org/

Apache License 2.0

267 stars 245 forks source link

Representation of `zero morphemes' in tokenization #125

Closed coltekin closed 6 years ago

coltekin commented 9 years ago

This may not exactly be an issue, but a question that I could not find an answer in the documentation. I hope this is the correct platform to ask such questions.

I am trying to work with Turkish, and meanwhile, I wanted to try to map some Turkish treebank data to UD format.

The range expressions or multi-word tokens in the new format solves a problem in the earlier formats for Turkish (although for Turkish the proper term would probably be "multi-token words").

This covers the cases where a single word can have multiple syntactic units. For example, in the sentence Mavi arabadaydı (He/she/it was in the blue car), the copula is part of the second word. So, for syntactic analysis it has to be treated separately from the noun it is attached to. This sentence can now be tokenized as,

1   Mavi    
2-3 arabadaydı
2   araba
3   ydı

The segmentation of the word is sometimes non-trivial, and the what lemma/stem to assign to non-word morpheme groups (typically called inflectional groups in the literature) is also not clear, but as long as we consistently use a single convention, that should be fine.

The issue arises when an inflectional group does not have a surface form. Following up in the sentence above, in the right context, Mavi araba could mean "it is the blue car". In this case, although we need to have a copula, we do not have a surface form for it.

My question is, is there a way in the CoNLL-U format to specify empty forms? Or is there any other/better way to express this?

dan-zeman commented 9 years ago

Is it necessary to have a node for the copula then? There are languages where copula-like constructions systematically omit the copula. E.g. in Russian, you'd say "eta mašina golubaja" lit. "this car blue". This is why copulas are attached as leaves in the UD trees, so that the resulting tree is as parallel as possible to languages that have overt copula. Here, "blue" will be the predicate and "car" will be attached to it as its subject. The copula will not be shown anywhere in the tree.

spyysalo commented 9 years ago

I am not able to comment on Turkish in particular, but this may be relevant (from http://universaldependencies.github.io/docs/u/overview/syntax.html):

The standard does not postulate and annotate “empty” things that do not appear in various languages

We systematically eliminate *null* nodes from TDT annotation in conversion to UD Finnish.

coltekin commented 9 years ago

Zero-derivation is not only for the copula. This is all over the place in the METU-Sabanci treebank (also Turkish analyzed elsewhere). The "traditional" morphological analysis of Turkish assumes that the words change their POS without an overt affix (and sometimes affix will be overt in some contexts, and not in others depending onthe phonological context). And some words can do that quite a few times (with or without overt suffixes). Just another example: The word okunmuş in Okunmuş kitap (The book that was/has been read) would be split as follows (notation is from trmorph, but it will be the same in other analyses).

oku<V>
<pass><V><evid>
<0><Adj>

So, we start with verb stem oku (read), which gets the passive marker. Here, the passive is analyzed as derivation, but this is not a problem, it could easily be converted into a morphological feature in UD. The problem is with the second derivation, where the verb becomes an adjective, and modifies the next noun (this does not have to be passive it can also be a verb with person agreement okuduğum - the book that I read). The reason we do not just want to analyze the whole word as a single adjective is that the verb within the adjective can also get modifiers. For example hızlı okunmuş kitap (the book that was read quickly) is possible. In this case the adverb hızlı modifies the verb, not the adjective.

I do not know if we can find alternatives for all these cases where we have a zero-form element, but I'm afraid it may end up with being rather unnatural, and also it would probably break the compatibility with the earlier data.

If that would not break things badly, it would best if there was a way to specify zero-forms.

Another possible approach is inserting a representative string instead of these zero forms (this might also help parsers trained on the data). Provided that one always uses the range expression to construct the form of the sentence, this might be fine. However, some of these zero-suffixes do not have any canonical surface form, and for some, one may end up using inconsistencies between the cases where there is a surface form and ceses where there isn't.

Coming back to the copula: Turkish normally does have an overt copula, but it is a suffix (there is also a clitic form that is used less frequently), and sometimes the suffix form may disappear from the surface in some contexts.

@spyysalo thanks for the pointer.

Personally, I am not quite fond of empty elements either. But, I do not see any other neat way for dealing with this case (and I am not the one who invented zero-derivation solution, nor using sub-word units in syntax). Furthermore, I think this is somewhat different than the empty elements that are the target of the quote above, these empty elements are not resulting from aplying a particular syntactic theory or caused by some sort of omission (in certain type of language), but they are the result of the morpho-phonological process.

As a side note, it seems the empty elements were used in the Turkish data in CoNLL-X, where they were represented with an underscore (I could not find how a literal underscore would be represented though).

dan-zeman commented 9 years ago

I know that it is customary for morphological analyzers of some languages (Turkish in particular and agglutinating languages in general) to tag derivational morphology, including all the part-of-speech changes on the path. I know that the Turkish CoNLL data from the METU Sabanci treebank used empty nodes to preserve this kind of information (I haven't seen any other treebank doing this, though). But I am still not convinced that dropping the empty nodes will have bad consequences (other than losing the derivational information of course—but it could be preserved in the MISC column or even in the POS column). Note that derivational processes take place in non-agglutinating languages as well, so e.g. in English one might want to encode the analysis of enlargements as large<A>en<V>ment<N>s.

I apologize if I am ignoring some fine points relevant to Turkish but for me, it is perfectly acceptable that an adverb modifies a verb, and if the verb is turned into adjective, then the adverb modifies the adjective. I can provide a similar example from Czech: přečíst means "to read", we can derive the adjective přečtený "something that has been read", and if you previously said rychle přečíst "to read quickly", you will now say e.g. rychle přečtená kniha "a book that has been read quickly". The dependency tree will be amod(kniha, přečtená); advmod(přečtená, rychle);.

On a more general note, http://universaldependencies.github.io/docs/u/overview/tokenization.html says that "there is no attempt at segmenting words into morphemes". The mechanism of word ranges was invented to represent fused words that mix two originally independent words that have different parts of speech and different functions in the sentence. They were not meant to encode derivation of one part of speech from another, as I see it.

spyysalo commented 9 years ago

There are related issues in Finnish. For example, for the word historiallisesti "historically" we get (from the Omorfi morpho analyzer)

historia<N><Der_llinen><A><Der_sti><Adv>

i.e. noun to adjective to adverb through two derivations. We took the perspective that UD does not seek to capture the derivation history (and could not easily do it in the general case) and opted to mark only the resulting POS and the last step of derivation (through the proposed feature Derivation, documentation largely pending). This is not ideal, but acceptable to us given the focus of the representation.

coltekin commented 9 years ago

OK. First, it seems I was being opportunistic by trying to use a feature for something that it was not meant for (that's why it is called "multi-word tokens" not "multi-token words" ;). But this does not solve the problem unfortunately.

For now, I'll leave the zero-form issue aside, and try to show why we need sub-word units for syntax.

Do I understand correctly that, in the Czech example parse above, the adverb modifies the adjective, but not the verb?

I think, at least in the Turkish case, this leads to wrong interpretation. Although one can argue that this is just a notational difference, it is not clear whether the adverb is modifying the action, or the adjective.

To make it a bit more concrete, another example: the typical analysis of arabadaydı in phrase mavi arabadaydı (he/she/it was in the blue car) in Turkish would have the following syntactic units:

araba<N><loc>
<0><V><cpl:past><3s>

Here, if we discard the derivation history, we would need to say that the adjective mavi (blue) modifies a verb. I think this is simply wrong. This would also interfere with the copula analysis in ways I am not able to get my head around right now.

Here is one more variation: the phrase mavi arabadakiler (the people/things in the blue car) would normally have the analysis:

araba<N><loc>
<ki><Adj>
<0><N><pl>

We can skip some of the details (like the zero derivation from an adjective to a noun), the crucial part is that if we let the adjective "blue" modify the final noun, that would make 'things in the car' blue, not the car itself. Here the blue object is definitely the car, not the things inside

I think these are real reasons for sub-word units. But maybe I'm used to this mode of analysis too much that I cannot see otherwise.

dan-zeman commented 9 years ago

I am not sure whether mavi arabadaydı is a problem (when the copula is there, you would have some degree of freedom to decide that it is actually simiar to the fused words, and make it a separate node; not sure whether the zero-copula case could then be viewed as just an adjective or not).

But the mavi arabadakiler example appears to be a problem and I do not see many options how to treat it. Splitting the noun to arabada and kiler (and saying that they are two separate syntactic words) might turn out to be the cleanest solution here. Question is whether to do this systematically for every ki derivation, or just in cases where one part is modified separately.

Note that the token-word section of the UD guidelines does not actually require that the two (or more) syntactic words are exact substrings of the surface token. Sometimes the underlying words have different forms when they stand alone, and sometimes it is not possible to draw the precise borderline between them in the surface token. So even if it is unavoidable to represent a syntactic word that is reduced to a zero morpheme on surface, it is in theory possible to make up a string that is the best possible representation of the syntactic word, and make that a new node.

yoavg commented 9 years ago

Sorry for being late in the discussion, but I agree with @coltekin that adding an "empty" element may be the cleanest solution in some cases. We have some similar cases in the Hebrew Treebank as well. In Hebrew, these are usually pronominal suffixes, as in the word בעודו, which means "while-he-was". The pronoun is then participating as the subject of a verb. However, it is not clear which letters exactly are responsible for the pronoun. One could say it is the ו letter which is responsible, and it would not be unreasonable to do, but it will not be a consistent solution across all forms. Adding an extra prn word (or just using the corresponding pronoun, הוא) seem like a very natural and elegant solution here.

It is important to note that these are not really "empty things that do not appear". They do appear, and are produced deterministically and consistently by the morphological analyzer. The only sense in which they are empty is that it is not clear which particular sub-word unit is responsible for them. I argue that it is cleaner and more consistent to produce a single "empty" element for all of these cases, than to choose a semi-arbitrary sub-word unit in each case.

jnivre commented 9 years ago

Sorry for not responding to this sooner. I have been snowed under with other things and have only now surfaced and can get back to the exciting UD stuff. :)

First of all, I want to say this is an extremely important issue and a very good test case for the lexicalist approach in UD. If we cannot make things work for Turkish (and other similar languages), we cannot claim to provide a universal framework. I cannot claim to fully understand all the complexities yet, even though I do have some experience of working with the IG representation of Turkish in parsing, but it seems to me that there are at least two orthogonal issues involved here:

Do we need to segment words into smaller units?
Do we need to posit empty elements?

If the answer to 1 is no, then 2 only concerns empty word forms. If the answer is yes, we can also discuss whether we need empty subword units.

So far, the UD policy is clearly to answer no to both questions, with the important clarification that "word" should be understood to mean "syntactic word", not "phonological word" or "orthographic word". Hence, the copula cases appear unproblematic to me. When it is there, it is split off as all other clitic-like elements. When it is not there, it is not there.

Then we come to derivation. The general policy here is to only record the final product of the derivation, which works fine for most languages because derivation is lexicalized not grammatical. Thus, using "run" as a noun is a case of zero derivation in English, but there is no need to represent this in a grammatical analysis, because it is assumed to be part of the lexicon. What makes Turkish different, if I understand correctly, is that derivation is much more productive and is used in ways that correspond to grammatical constructions in other languages. In fact, some of the cases from Turkish look more like incorporation than derivation to me. And we haven't discussed at all how to deal with incorporation, but my first instinct would be to say that incorportations should be split because they contain several syntactic words.

coltekin commented 9 years ago

The important issue is 1 above, I believe (although my original question was about 2). I think, saying "no" to empty word forms is easy for Turkish dependency parsing. 2 is only an issue together with 1.

I am not sure the issue addressed by "inflectional groups" in Turkish should be called incorporation (as much as I understand it), but it is definitely somewhat different than "typical" derivation in other languages.

I will try to summarize the issue and provide a few examples. I am not in the field for long, I might be missing some of the obvious solutions and/or problems. Corrections are welcome.

First, some examples:

Repeating my previous example above, the word arabadakiler in the sentence mavi arabadakiler 'people/things in the blue car' is typically analyzed as

araba<N><loc>   - car-LOC `in the car'
<ki><Adj>       - `that/which is in the car' (similar to English relative clauses)
<0><N><pl>      - `the (people/things) in the car'

Again skipping the intermediate step, there are two nouns here. And they can participate in different syntactic relationships. For example, in the example noun phrase, the thing that is blue (mavi) is the car, not the people in it. There is only one car, while there are multiple people inside the car. And, if the phrase would be the subject of a sentence, the subject would be the people inside the car, not the car.

Here is another example, this time involving verbs (inflectional groups (IGs) are separated with |).

Sorun     tarafların         konuşmamasıydı
sorun<N>  taraf<N><pl><gen>  konuş<V><neg>|<vn:inf><N><p3p>|<0><V><cpl:past><3s>
`the problem was (that) the parties (were) not talking (to each other)'.

Here the last word actually contains two predicates. The first is the verb 'speak/talk' that is nominalized after negative marker meaning roughly 'the state of not speaking' and then, becoming a predicate once more with a copula. (The analyses are again from TRmorph, not from the treebank, but it should suffice).

The issue here is that both predicates within this word has their own subjects. That is, the subject of konuş- 'talk' is taraflar 'the parties', and the subject of the copula is sorun 'the problem'. If we follow the UD analysis, and do not analyze the copula as the head, the head of the noun sorun is shifted to the second IG of the last word, but still the same word has two subject relations.

I will not attempt to analyze it, but here is a version of the "long word" example that the Turkish linguists like to use for impressing others:

Istanbul-lu-laş-tır-ama-dık-lar-ımız-dan-mış-sınız
‘You are (supposedly) one of those who we could not convert to an Istanbulite’

The problem is not the length in morphemes or letters. If we collapse all morphological process, we would have difficulties expressing the differences of syntactic analyses of the sentence above and just Istanbul or Istanbullu (say, as a response to the question 'where are you from').

Collapsing the syntactic units within the words like above, and assigning them to a single syntactic function will either make it wrong, or it will require working around the issues in not-so-principled ways.

Once we allow splitting the words, there are some other cases where subword units make the analysis "cleaner", IMO. For example, reflexive suffix in Turkish is rather productive. It would not be as problematic to mark this as a morphological feature, and do not invent a new unit there. However, it also changes the verb valency. For example, an intransitive verb like uyu 'sleep', when made reflexive, uyu-t 'to make someone sleep', becomes transitive. So, admitting a new IG also makes it more transparent to signal that after the affix the verb becomes transitive, although the root is not.

There are quite a few other cases too that makes the IG approach attractive. If we were talking about "usual" derivation, I think we could do without it. For example for the -cı, quite productively, makes occupations (çikolata-cı 'person who makes or sells chocolate'). It would not hurt to hide the derivation from syntax here. Although, analyzing the cases like sıcak çikolata-cı 'hot chocolate seller/maker' as adjective modifying the non-derived noun rather than the derived one (when chocolate is hot, not the seller) is also intriguing once we start splitting the words.

I think we can set some language-specific recommendation on the level of splitting, but in any case I believe there is a need for subword units for Turkish. I also expect this to be an issue for (at least) other Turkic languages as well (although I do not know/speak any of them).

Technically, the mechanism used in CONLL-U format for multiword tokens already solves the issue. Only that the use of the same mechanism for multitoken words needs to be approved.

Zero morphemes

Null elements become an issue only when we admit that there are sub-word units. But these are not like the null elements postulated in some theories of syntax. For example, the null copula that started the discussion shows up in some morphophonological environments. I think we can easily develop a convention for syntactic tokens of zero morphemes. After all the token line in CONLL-U will allow recovering the correct surface.

jnivre commented 9 years ago

Thanks for clarifying. I think there are compelling arguments that we need to recognize (some of) the inflectional groups in Turkish. However, it is not clear to me why you call this a case of "multitoken words" as opposed to "multiword tokens". To me it just looks like a case where a (whitespace-delimited) token contains several syntactic words, for example, a noun (araba) and a relativizer (ki), somewhat like a clitic (although I understand that this may not be the correct analysis).

Similarly, in the second example, we have something like:

problem parties not-talking-is

To get a correct syntactic analysis, we need to split off the copula:

nsubj(problem, not-talking) cop(problem, is) nsubj(not-talking, parties)

The only special thing about "not-talking" is that is a non-finite predicate, but I see no reason here to split it further into a verb and a nominal derivation suffix. In fact, this seems parallel (except for the enclitic copula) to the following English example:

The problem is the parties not talking

So, in conclusion, we need to segment some tokens into several syntactic words. These words will sometimes correspond to single IGs, sometimes to multiple IGs where the further segmentation is not needed for an adequate dependency analysis. The question whether some of these need to be empty is then the same question as whether we need empty words in general, to which the tentative answer is (still) no.

coltekin commented 9 years ago

The reason for the term "multitoken words" was because I thought calling (a group of) IGs "word" or "syntactic word" would raise a few eyebrows. Otherwise, I am definitely fine with name/term "multiword tokens".

About the analysis, if I understand copula analysis in English UD correctly, the analysis above should be something like:

nsubj(not-talking, problem) cop(not-talking, is) nsubj(not-talking, parties)

That is, 'not-talking' is what the 'parties' do, and 'not-talking' is also heading the copular construction (problem is ... not-talking).

The first problem I see is two nsubj relations for 'not-talking'. And, the second is, 'not-talking' here is a noun phrase. I understand its predicate function in nsubj(not-talking, parties) together with cop(not-talking, is). The relation nsubj(not-talking, problem) does not have an associated copula. So, I guess it should be a verb to be a predicate. Furthermore, the verb 'talk' in nsubj(not-talking, problem) is finite, it can have tense/aspect/modality and person agreement.

The above argument is just for to clarification. As long as it is OK to use sub-word "syntactic words" I think the Turkish NLP community will be happy. I also get the message that for UD, one should be rather conservative while splitting words. But I believe the details can be settled with further discussion of each case/construction.

jnivre commented 9 years ago

The analysis of the English sentence should be the same as in Turkish:

nsubj(problem, not-talking) cop(problem, is) nsubj(not-talking, parties)

This is assuming that "problem" is the predicate, not the subject. Nonverbal predicates are fine in UD:

coltekin commented 9 years ago

In the example Turkish sentence, 'proplem' is the subject (of the copula). One can turn it around, but interpretation changes.

I'd translate the original example Sorun tarafların konuşmamasıydı as "the problme was (that) the parties did not talk". The analysis of this should be:

nsubj(not-talking_N, problem) cop(not-talking, is) nsubj(not-talking_V, parties)

If we swap the subject and the predicate, Traflarin konuşmaması sorundu, I'd translate it as something like "the fact that parties did not talk was a problem". Now it is

nsubj(problem, not-talking) cop(problem, is) nsubj(not-talking, parties)

Besides the change in the structure, it seems to have a side effect too. In the first one, I get a definite reading for "the problem". With the second it cannot be "the problem", it is indefinite.

manning commented 9 years ago

Sorry, I haven't been finding enough time to put my linguist hat on lately. I agree that it is important to have a good account of languages like Turkish, and that eventually a more detailed morphology will be necessary. But I would just comment now for anyone looking at this that while "syntactic incorporation" accounts of complex word forms dominate much of current theoretical syntax (at least in the U.S.A.) there is considerable work on lexicalist accounts of phenomena like incorporation. I'm less familiar with what has been done on Turkic languages, but in general for a lexicalist handling of word incorporation, you might look at Steve Anderson's work, such as this paper. Also relevant is Rosen, Sarah. 1989. Two types of noun incorporation: A lexical analysis. Language 65, 294-317.

coltekin commented 9 years ago

If I understand correctly (from what I read, including the references above), incorporation is embedding a word (a free morpheme, typically a noun) into another word (typically a verb). In case of Turkish, the morphemes that cause word class/POS change are not really free morphemes.

There are a few cases that we need/want to admit a single surface word including multiple syntactic units:

the morpheme -ki that derives adjectives and (pro)nouns from nouns. In this case both nouns (the stem and the derived noun) can have different features (number/case/possession) and participate in different syntactic relations.
the copular marker (which is null in some cases -- the original subject of this thread) that turns nominals into verbs.
a few subordinating suffixes that form subordinating clauses from verb( phrase)s.
in some cases, assuming separate inflectional groups for a few other verbal suffixes, such as causative, may also be necessary or result in neater analyses.

I might be missing some others, but I think these are the main cases where it is difficult to get reasonable analyses with relying on the final POS tag. In all of these cases the complex words do not include more than one free morphemes. The morphemes that create multiple syntactic units are bound morphemes (except maybe the copular marker, which has a rarely used clitic alternative). In all cases, the bound morpheme belongs to a closed class of morphemes. The functions of these morphemes are comparable to function words in English.

So, I think this is closer to derivation than incorporation. But, in any case, the analysis without admitting multiple syntactic units within a single surface word is difficult, likely more difficult than the typical noun incorporation case (where probably resulting surface word can be treated as a single phrase/syntactic unit without much difficulty).

jnivre commented 9 years ago

I think the crucial issue here is deciding where to draw the line between what we segment and what we don't segment. So far, we agree that clitics need to be split from their hosts and that contractions need to be undone, because the two elements occurring together may participate in different syntactic relations and one cannot be reduced to expressing a property of the other. For (other types of) bound morphemes, we have so far made the assumption that they should not be split off their hosts, and that their contribution should as far as possible be analyzed by means of morphological features on words. But perhaps there is a restricted class of morphemes used in productive derivations that should go with clitics in this respect. If so, the evidence should be of the same kind as that supporting the separation of clitics, namely that the two elements can participate in different syntactic relations, which cannot be captured by assigning a single dependency relation to the whole unit. Therefore, it would be very valuable if you could provide one example of each of the four cases you list with the kind of dependency analysis you would like to see in each case. I apologize if this is something that you have already done before, but it would be useful to have them all together as a basis for further discussion.

ftyers commented 9 years ago

Having read through this discussion, I think there is a relevant point that has not been included:

In Turkic languages (by and large), although the third person (present/future form) of the copula is "empty", it appears in 1st/2nd person forms and in other tense/aspect forms. This is not like in Russian, where there is no copula regardless of person "Я студент." "Она студентка." etc. So for example:

(Turkish)

Öğretmen = teacher = He|She|It|They (is|are) a teacher.

Öğretmenim = I am a teacher.

Öğretmeniydi = He|She|It|They was a teacher.

Öğretmeniydim = I was a teacher.

(Kyrgyz)

Мугалим = teacher = He|She|It|They (is|are) a teacher.

Мугалиммин = I am a teacher

In Kyrgyz there is no clitic form for the past:

Мугалим эле = He|She|It|They (was|were) a teacher

Мугалим элем = I was a teacher

(Kazakh)

Мұғалім = teacher = He|She|It|They (is|are) a teacher.

Мұғаліммін = I am a teacher

In Kazakh there is no clitic form for the past:

Мұғалім еді = He|She|It|They (was|were) a teacher.

Мұғалім едім = I was a teacher

(Tuvan)

Башкы = teacher = He|She|It|They (is|are) a teacher.

Башкы мен = I am a teacher.

In Tuvan there is also no clitic form of the past. Note that the 'мен' part has the same form as the personal pronoun, so you could have e.g.

Ол башкы = That teacher = He|She|It is a teacher

Мен башкы мен = I am a teacher

but note:

*Ол башкы ол

Note: Tuvan doesn't have a directly equivalent past copula form.

dan-zeman commented 9 years ago

For the record, Russian actually has overt copula in the past and future tenses, and in the conditional. The past copula still does not distinguish person (because past participles do not mark person in Slavic languages) but it distinguishes gender and number: Я был студентом. Она была студенткой. = “I was a student. She was a student.”

coltekin commented 9 years ago

Sorry for responding slow to the earlier request to document the cases where I think sub-word units should be used.

Here are the example cases (some repeated from the conversation above):

`-ki` suffix

This suffix derives adjectives/nominals from inflected nouns. The derived nouns may also be inflected again, and another -ki may be suffixed (two of them on a row is not that uncommon, more than two is rare). The important part for syntax is that multiple nominals in the same word may have repeated/different features, and participate in different syntactic relations.

Here is an exmaple:

Mavi arabadakiler   gazete    okuyor.
blue car-LOC-KI-PLU newspaper read-PROG
`the ones in the blue car are reading newspapers'

The word arabadakiler refers to two related but separate entities. The first one is 'car', and the second one is people in the car.

While there is a single car, there are multiple people (here only people-IG has the plural feature, but car could be plural as well). While 'car' is in locative case, 'people' is nominative (or case is not specified).

Similarly, the adjective 'blue' refers to the car. But the subject of the predicate 'read' is not the car but the people in the car.

You can see an attempt to analyze the whole sentence above here. Underscores following a word represent the non-first IGs in the morphological analysis (we could possibly segment the words, but this is for another discussion). Following earlier conventions, this analysis assumes that -ki derives an adjective, and it becomes a (pro)noun by a zero derivation. I think the "zero derivation" can be collapsed, but we need both noun IGs for a reasonable dependency analysis. Following the CoNLL version of the METU-Sabancı treebank I marked the word-internal dependencies deriv. But these could be converted to more meaningful labels (e.g., nmod or amod in this case).

Subordination

In Turkish, subordinate clauses are mainly formed by bound morphemes. The head of the clause gets one of about 20 suffixes, and the resulting clause functions as a noun, adjective or adverb, modifying nominals or predicates. The nominalized (subordinate) clause can get nominal suffixes (that scope over the whole clause), and can participate in all syntactic relations a noun (or adjective or adverb) can participate. The head of the clause (a verb or a nominal predicate) will typically have its own subject object or nominal/adverbial modifiers within the subordinate clause. So, parts of the word participate in different syntactic relations.

Here is a simple example:

Ali bunu     onun    yaptığını      görmedi
Ali this-ACC she-GEN do-PAST-VN-ACC see-NEG-PAST
`Ali did not see (that) she did this.'

In this example the fourth word yaptığını starts its life as a verb, takes a direct object (this), and a subject (she). After the verbal-noun suffix, it behaves as a noun, takes the accusative suffix, and acts like the direct object of the last verb.

You can see two alternative analyses here. First one is roughly how METU-Sabancı treebank analyzes it. The whole clause is analyzed as an ordinary noun (phrase). The second one is another alternative, in some way looks more UD-like to me, but also makes recovering some of the information (e.g., type of the relation between the subordinate clause and the main predicate) somewhat difficult.

Copula

Forming nominal predicates through copular suffixes is very common in Turkish. In the conventional analysis, copular markers introduce a new IG. Verbal features like tense, aspect, person/number agreement are marked on the copula, while nominal features are marked on the (initial) noun IG.

Again, here is an example (suggested analysis is the first one here):

Onun tutkusu       spor   arabalardı.
His  passion-POS3S sports car-PLU-CPL-PAST-3SG
`his passion was sports cars'

Here, too, parts (IGs) of the word have different/conflicting features, and also participate in different syntactic relations. The noun (car) is plural, while the copula holds a singular agreement marker. The noun gets into a compounding relation with the previous noun, and the predicate IG has a subject.

Things get even more complicated since the nominal that becomes predicate can initially be a verb which is nominalized by a suffix discussed in subordination secrtion above. I am copying the earlier example below:

Sorun     tarafların         konuşmamasıydı
sorun<N>  taraf<N><pl><gen>  konuş<V><neg>|<vn:inf><N><p3p>|<0><V><cpl:past><3s>
problem   side-PL-GEN        talk-NEG-3PL-VN-CPL-PAST-3SG
`the problem was (that) the parties (were) not talking (to each other)'.

Here the last word contains two predicates. The predicates have conflicting features (subject agreement), and they both have their own subjects (see second analysis here).

There are further complications like auxiliaries, where attaching them to a copula IG makes more sense than attaching them to the noun (even when we keep noun as the head of the clause).

One last complication is the `zero copula' that started this discussion. If we do not mark it, it is not possible to distinguish sentence fragments from full sentences. For example the word 'Ali öğretmen' could be analyzed as a fragment (NP) 'Teacher Ali' or as a full sentence 'Ali is the/a teacher'.

Other verbal suffixes

There are a few more verbal suffixes which seem to make a lot more sense if they introduce IGs. Examples for all of them here will make this response even longer. Documenting all of them in a separate/dedicated documentation may be a better idea (I intend to do that soon). And discuss the points that seem unclear or not so UD-like. Here, I will only bring up one of them.

The causative is listed as a morphological feature in current UD documentation, and also in most grammar books. In the Turkish NLP literature so far, it is treated as a "derivation". Reasons include

The verb and its causative form can have different subjects/objects/modifiers
It alters the valence of the verb (an intransitive verb becomes transitive)
Like -ki above, it can be repeated

Here is a quick example:

Babam           arabayı Ali ustaya      yaptırmış
father-POS1S    car-ACC     master-DAT  fix-CAUS-EVID
`My father made master Ali fix the car'

Here the person who fixes the and the person who makes the other to fix the car are different. I think correct analysis needs two subjects, and both have heads within the same word (example here). One can collapse the IG, and mark 'Ali usta' (the fixer) as some sort of modifier, but I think correct solution is to mark a subject as a subject.

As a closing example, here are two related sentences to demonstrate double causative.

Biz Düzce'yi       hoplatırız
We  Düzce-ACC    jump-CAUS-1PL
`we cause the Düzce (a city) to jump' or `we shake the city (we will dance, city will shake)'

Biz Düzce'yi   hoplattırırız
We  Düzce-ACC  jump-CAUS-CAUS-1PL
`we cause someone to make the city to jump' or `(we will sing, people will dance, city will shake)'

Multiple causative markers are rare, and in quite a few cases double causative means only a single causative, but the examples like the one above do exist.

Besides causative, other voice suffixes and a number of suffixes that make "compound verbs" are typically considered introducing new IGs. I think we can treat reflexive and reciprocal suffixes as standard derivation (no new IGs), but there may be good cases why passive should also introduce a new IG.

Summary

I hope above are enough to justify the need for IGs, and we agree that sub-word units are fine in UD. For the rest, we can discuss individual cases and their solutions probably in separate threads, as I feel this one is becoming too long and unfocused.

I am trying to do some experimental annotation (a couple of random sentences every day) and note the problems I encounter. I intend to draft some language-specific documentation, which may stimulate more discussion.

And a last note: all examples are real (from a large web corpus) but I altered some of them slightly for clarity.

jnivre commented 9 years ago

Thanks for providing this careful summary. I do think there is compelling evidence to treat some of the traditional IGs as words (in the UD sense). Exactly where to draw the line will have to be determined by working out specific guidelines for different constructions. If we all agree on this, then I think this issue can be closed (and possibly more specific ones be opened, as suggested by @coltekin.

Note though that if IGs are treated as independent (syntactic) words, then they don't necessarily have to be dependents of their "hosts" (and in the case of copulas, if I understand correctly, they shouldn't be). Moreover, they should have real dependency relation, not a dummy relation like "deriv". If it is not possible to assign a real dependency relation to them, it is probably an indication that they shouldn't be treated as independent words.

I am tentatively closing this issue. Feel free to reopen it, if you disagree.

gulsenceb commented 9 years ago

Hi all,

We wanted to reopen this issue before Uppsala Meeting for your attention.

We apologize for participating in this discussion so late. As the ITU NLP Team, we have been working on creating a new Turkish dependency formalism (Sulubacak, U., & Eryigit, G., 2014). We reannotated the existing Turkish treebank and annotated a brand new Turkish Web2.0 Treebank according to this new formalism (Pamay, T., Sulubacak, U., Torunoglu-Selamet, D., & Eryigit, G., 2015). This summer, we have finally started mapping to UD. We’ve almost finalized the mapping for the dependency types by strictly following the current guidelines. (Although we agree with the ongoing discussions for coordination structures to be specialized.) We want to add our opinions about the discussions related to the morphology, which we believe is very crucial for Turkish dependency parsing studies. As pointed out by Prof. Nivre previously in this topic, we also want to emphasize that the benefit of using sub-word units (IGs) has been proven priorly in many studies (Eryiğit, G., & Oflazer, K. 2006, Eryiğit et al. 2006 & 2008). And we believe that their representation in the UD formalism is very important in order to make the framework usable in our language.

Before starting the discussion and giving you further details with examples, we want to say that the current state-of-the-art Turkish morphological analyzer (available via http://tools.nlp.itu.edu.tr (Eryiğit 2014) which serves more than 70 researchers currently) eliminates the unnecessary IG boundaries, thus reducing the total number of such kind of derivations.

Further Details

Firstly, to deal with the empty space problem for sub-word units (which was discussed above in previous posts) we proposed the solution below in 2013. (Sulubacak, U., & Eryigit, G. (2013): In this representation we have shown that the form and lemma columns of the IGs from the original word do not have to be empty, as we have removed zero morphemes.

13  _   sağlam Adj Adj _   14  DERIV
14  _   _   Verb    Become  _   15  DERIV
15  _   _   Verb    Caus    _   16  DERIV
16  _   _   Verb    Pass    Pos 17  DERIV
17  sağlamlaştırılmasının _ Noun  Inf2  A3sg|P3sg|Gen 18  POSSESSOR

13  sağlam sağlam Adj Adj _   14  DERIV
14  sağlamlaş sağlam Verb    Become  _   15  DERIV
15  sağlamlaştır sağlamlaş Verb    Caus    _   16  DERIV
16  sağlamlaştırıl  sağlamlaştır Verb    Pass    Pos 17  DERIV
17  sağlamlaştırılmasının _ Noun  Inf2  A3sg|P3sg|Gen 18  POSSESSOR

Previously, as seen in the METU-Sabancı Treebank a high amount of IGs were produced after the analysis. In the state-of-the-art analyzer this is not so, needless IGs are not generated, and the resulting structures are not as complex, but IGs are still needed, as seen below:

1   sağlamlaştırıl  sağlamlaş Verb    Caus    Pass|Pos        
2   sağlamlaştırılmasının sağlamlaştır Noun    Inf2    A3sg|P3sg|Gen

Since you already discussed the example okunmuş kitap, below we provide its analysis with our new framework which does not produce zero derivation:

oku+Verb+Pass+Pos^DB+Adj+NarrPart
kitap+Noun+A3sg+Pnon+Nom

The word okunmuş contains two IGs, the first of which has the POS tag VERB, the second ADJ. Instead of considering the suffix –muş as reported past tense we have considered it a derivational suffix, making an adjective out of a verb.

As a side note: We have not included the accuracy of DERIV relations in any of our evaluations, but we may of course give valid names to these dependency relations if necessary.

We believe it would be beneficial for these to be considered at the Uppsala meeting. We would appreciate hearing your thoughts on these subjects.

-ITU NLP Team-

Memduh Gökırmak Tuğba Pamay Umut Sulubacak Gülşen Eryiğit

References Sulubacak, U., & Eryigit, G. (2015). A Redefined Turkish Dependency Grammar and Its Implementations: A New Turkish Web Treebank & the Revised Turkish Treebank. (Under review) Pamay, T., Sulubacak, U., Torunoglu-Selamet, D., & Eryigit, G. (2015, June). The Annotation Process of the ITU Web Treebank. In The 9th Linguistic Annotation Workshop held in conjunction with NAACL 2015 (p. 95).) Gülşen Eryiğit and Kemal Oflazer . Statistical dependency parsing of Turkish. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 89-96, Trento, 3-7 April 2006. Gülşen Eryiğit, Joakim Nivre, and Kemal Oflazer . The incremental use of morphological information and lexicalization in data-driven dependency parsing. Computer Processing of Oriental Languages, Beyond the Orient: The Research Challenges Ahead, Springer LNAI 4285, pages 498--507, 2006. Gülşen Eryiğit, Joakim Nivre, and Kemal Oflazer. Dependency Parsing of Turkish, Computational Linguistics, 34 no.3, 2008. Gülşen Eryiğit. ITU Turkish NLP Web Interface. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014). Gothenburg, Sweden, April 2014 Umut Sulubacak and Gülşen Eryiğit, Representation of Morphosyntactic Units and Coordination Structures in the Turkish Dependency Treebank. In Fourth Workshop on Statistical Parsing of Morphologically Rich Languages (p. 129)) 2013

jnivre commented 9 years ago

Thanks, Gülsen. This seems to me to be a step in the right direction. It would be interesting if you could try to represent your analysis using tokens and words as required by the CoNLL-U format. See http://universaldependencies.github.io/docs/format.html.

Tugbapmy commented 9 years ago

Our suggestion would be as seen below:

1-2 sağlamlaştırılmasının
1   sağlamlaştırıl          sağlamlaş VERB    _   Negative=Pos|Number=Sing|Person=2|Tense=Imp|Voice=Cau|Voice=Pass
2   sağlamlaştırılmasının sağlamlaştırıl  NOUN    _   Case=Gen|Number=Sing|Person=3|Poss=Yes|VerbForm=Inf

Suffixes in Turkish are usually meaningless when used alone. If the morphological representation was made directly in the CoNLL-U format, we would need to show the suffixes on their own, causing these suffixes to be seen as if they by themselves were IGs of that word. To prevent this, we propose writing each IG with the latest derivation of the word, as seen above in our example of sağlamlaştırılmasının.

Dividing the word into its constituent root and suffixes would make it virtually impossible for the dependency parser to recognize the word, for instance the causative suffix -tırıl -masının. These are just suffixes which are not valuable by themselves, as seen below:

1-2 sağlamlaştırılmasının
1   tırıl         sağlamlaş VERB    _   Negative=Pos|Number=Sing|Person=2|Tense=Imp|Voice=Cau|Voice=Pass
2   masının   sağlamlaştırıl  NOUN    _   Case=Gen|Number=Sing|Person=3|Poss=Yes|VerbForm=Inf

ftyers commented 9 years ago

(For those following by email, sorry about the post confusion. It's late and I'm half way from Kızıl to Stockholm) :)

So, my suggestion would be:

1-3 sağlamlaştırılmasının 
1   _        sağlamlaş VERB    _   Negative=Pos|Number=Sing|Person=2|Tense=Imp|Voice=Cau
2   _        ıl        _       _   Voice=Pass
3   _        ma        NOUN    _   Case=Gen|Number=Sing|Person=3|Poss=Yes|VerbForm=Inf

This gets around the problem of having a surface form which doesn't really exist. It also gets around the problem of the lemma repeating information.

dan-zeman commented 9 years ago

@Tugbapmy : At any rate, a dependency parser (or a preprocessor) will have to solve the transition between the token level and the word level. This is neither specific to Turkish, nor to this particular proposal. Once this step is solved somehow, it may actually profit from learning dependency relations between suffixes (because it may see the same suffix with another word, right?)

@ftyers : I do not like leaving the form field empty. I would prefer finding a reasonable word-like or morpheme-like string representation of the unit corresponding to the node, if at all possible. (Even that would be a shift from the current position that "there is no attempt at segmenting words into morphemes" (http://universaldependencies.github.io/docs/u/overview/tokenization.html).)

jnivre commented 9 years ago

I have not had time to follow everything in this thread, but leaving either the word form or the lemma empty does not seem compatible with the UD guidelines. I think the goal should be to find a representation of both that makes sense not only for Turkish specifically but in a multilingual perspective.

jonorthwash commented 9 years ago

It would be possible to do something like this:

1-3 sağlamlaştırılmasının 
1   sağlamlaştır   sağlamlaş VERB    _   Negative=Pos|Number=Sing|Person=2|Tense=Imp|Voice=Cau
2   ıl             ıl        _       _   Voice=Pass
3   masının        ma        NOUN    _   Case=Gen|Number=Sing|Person=3|Poss=Yes|VerbForm=Inf

Including the segmental content that is described, while not representing anything that a lay Turkish speaker would actually recognise, does have the advantage of allowing one to keep track of what's actually being described. That said, there may be times when the content would need to be blank, as there could be no segmental content corresponding to the new unit.

jnivre commented 9 years ago

This looks good to me. Note, though, that there is no requirement that the token (1-3) should be identical to the concatenation of the words (1 + 2 + 3). While this might work for Turkish, it would not work for many other languages. Even in French, for example, we have things like "du" = "de" + "le". Empty strings, on the other hand, have so far not been allowed in UD.

gulsenceb commented 9 years ago

The decision of what makes an IG and what is the best representation for these is still an open research topic for Turkic languages as may be seen from discussions in many issues. I believe that the Universal Dependency Scheme should be flexible in order to allow further research in this field. According to my experience both on manual annotation and statistical parsing stages, adding a new sub-word unit (IG) for each new voice in a word makes the syntactic representation and parsing unnecessarily complex (and reduces the success of parsers and human annotators on the task). So, I support the suggestions for allowing multiple voices in a single line (#197) and keep the number of sub-word units as small as possible. In this case, @jonorthwash’s suggestion could be rewritten as the following.

@jonorthwash’s suggestion

1-3 sağlamlaştırılmasının 
1   sağlamlaştır   sağlamlaş VERB    _   Negative=Pos|Number=Sing|Person=2|Tense=Imp|Voice=Cau
2   ıl             ıl        _       _   Voice=Pass
3   masının        ma        NOUN    _   Case=Gen|Number=Sing|Person=3|Poss=Yes|VerbForm=Inf

updated

1-2 sağlamlaştırılmasının
1   sağlamlaştırıl  sağlamlaş VERB    _   Negative=Pos|Number=Sing|Person=2|Tense=Imp|Voice=Cau|Voice=Pass
2   masının         ma        NOUN    _   Case=Gen|Number=Sing|Person=3|Poss=Yes|VerbForm=Inf

Instead of two Voice values (Voice=Cau|Voice=Pass) in the feature tab, we may also use a + sign like this: Voice=Cau+Pass or any other sign suggestion is welcome though the comma sign is already used for ambiguity.

For the lexicalization of these, I suggest providing the lemma of the actual word in the lemma field of the first IG and the surface of the whole word in the surface field of the last IG as was the tradition since the beginning (see below). But the updated example above is also acceptable to us.

1-2 sağlamlaştırılmasının
1   sağlamlaştırıl          sağlamlaş VERB    _   Negative=Pos|Number=Sing|Person=2|Tense=Imp|Voice=Cau+Pass
2   sağlamlaştırılmasının sağlamlaştırıl  NOUN    _   Case=Gen|Number=Sing|Person=3|Poss=Yes|VerbForm=Inf

jnivre commented 9 years ago

Thanks, Gülsen. We had a very good discussion about this in Uppsala on Sunday, and I am sure that we will be able to work something out. We also have the mechanism of layered features to handle cases where the same word encodes the same feature more than once.

UniversalDependencies / docs

Representation of `zero morphemes' in tokenization #125

-ki suffix

Subordination

Copula

Other verbal suffixes

Summary

`-ki` suffix