UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
269 stars 245 forks source link

Turkish "attributive" derivation marker causing problems in NPs with an adjective. #787

Open TalhaBedir opened 3 years ago

TalhaBedir commented 3 years ago

Turkish have a derivational suffix -lI that is generally referred to as "attributive suffix" in the literature. It is highly productive:

(1) a. Sos-lu makarna
     sauce-ATTR pasta
     'pasta with sauce'

    b. Çekmece-li dolap
    drawer-ATTR wardrobe
    'wardrobe with drawers'

But it is also used in relatively fixed forms:

(2) a. Önem-li konu
    importance-ATTR subject
    'important subject'

    b. Emek-li öğretmen
    labor-ATTR teacher
    'retired teacher'

Its negative counterpart -sIz which roughly means "without" can negate the examples in (1) but generally not those in (2).

Currently, since these are derivational, we include full forms in lemmas, without any split. In that case soslu "with sauce" and çekmeceli "with drawers" in (1a,b) are ADJ in UPOS and amod in dependency:

(3) soslu makarna
  amod(makarna, soslu)

However, when an adjective modifies soslu "with sauce", for example, or any other denominal adjective of this sort we have a situation like this:

(4) acı soslu makarna
  hot sauce.ATTR pasta
  'pasta with hot sauce'
  amod(makarna, soslu)
  amod(soslu, acı)

This situation, for me, is troubling for two reasons:

  1. In Turkish, like most languages adjectives are not modified by adjectives but by adverbs.
  2. acı "hot" in example (4) is actually modifying sos "sauce" rather than soslu "with sauce", which means that the suffix -lu is actually the suffix of the whole NP acı sos "hot sauce", rather than just sos "sauce".

Therefore I do not think the current annotation, that is (4), is doing this structure justice.

dan-zeman commented 3 years ago

I find it quite adequate to treat the -lI form as an adjective derived from a noun, despite the shortcomings it has.

I think it would be possible to use advmod(soslu, acı) instead of amod, while keeping the ADJ tag for acı.

The suffix does in fact modify the whole phrase acı sos but that seems to be the nature of agglutinating languages like Turkish, and isolating the suffix as a “syntactic word” would not help much (while complicating the processing) because we do not have relations for an “adjectivizing construction”. If I am not mistaken, the case suffixes behave similarly (also affecting the whole nominal while attaching only to the head noun), so one could theoretize about a new morphological case in Turkish, but I think it would be better to keep treating this process as derivational.

rueter commented 3 years ago

Example (4) looks very interesting from a Komi-Zyrian perspective. My question is whether the "attributive suffix" can be added after plural markers as well?

(4) acı soslu makarna
  hot sauce.ATTR pasta
  'pasta with hot sauce'
  amod(makarna, soslu)
  amod(soslu, acı)

In other words, would it be possible to say 'pasta with hot sauces' by saying the hypothetical acı soslarlu makarna. In Komi-Zyrian it is possible to

(5) гырысь позянлунъяса страна gïrïś pośanlunjasa strana great possibility.Plur.ATTR country 'A country with great possibilities' nmod(strana, poźanlunjasa) anmod(poźanlunjasa, gïrïś)

We have chosen to call this a NP head marker (case marker) almost entirely limited in range to the adnominal phrase. If there is no regular number variation, then I would stay away from our move.

TalhaBedir commented 3 years ago

I actually just realized that neither -lI nor -sIz permits any plural suffix inside, which is very interesting since the example you have provided produces extremely similar results as Turkish.

I did not do any research on this at all, but I guess it might be due to different properties of affixes or due to the fact that Turkish nouns are number-neutral in their bare forms:

(6) Kütüphane-den kitap al-dı-m
    library-ABL   book  take-PAST-1s
  'I have taken a book/books from library.'

It could be one book or 100 books, doesn't matter. Any numeral reading is available here. Therefore, it might be the case that the root should somehow stay in this bare form in order to be derived by ATTR without Crash.

ftyers commented 3 years ago

I find it quite adequate to treat the -lI form as an adjective derived from a noun, despite the shortcomings it has.

I think it would be possible to use advmod(soslu, acı) instead of amod, while keeping the ADJ tag for acı.

The suffix does in fact modify the whole phrase acı sos but that seems to be the nature of agglutinating languages like Turkish, and isolating the suffix as a “syntactic word” would not help much (while complicating the processing) because we do not have relations for an “adjectivizing construction”. If I am not mistaken, the case suffixes behave similarly (also affecting the whole nominal while attaching only to the head noun), so one could theoretize about a new morphological case in Turkish, but I think it would be better to keep treating this process as derivational.

What would then be done with something like,

(7) dört odalı ev
     four room-with house

Would you have

nummod(odalı-ADJ, dört-NUM)
amod(ev-NOUN, odalı-ADJ)

Does the validator allow ADJ to have NUM dependents?

How about:

(8) Rüyada çok odalı bir evim varmış.
     dream-LOC much/very room-with one house-my exists-PAST.EVID
"In my dream I had a house with many rooms."

çok here means "many/much", but it also means "very" (çok büyük - very big). At the moment in the treebank, the first reading is given with ADJ and det, the second with ADV and advmod, although this isn't very consistent, as with mst-0617 Buzlu ve çok sodalı. and mst-0771 Diğer çok kaliteli pilotların, subayların olayda ölmüş olması çok önemli bir konuydu.

This is also similar to the nominal -ed construction in English, e.g.

But not as much like e.g. "tree lin-ed street" or "grass cover-ed hill".

dan-zeman commented 3 years ago

Does the validator allow ADJ to have NUM dependents?

I believe it does. It should because such configuration can also occur as a result of noun ellipsis and promotion of the adjective to the head. The Turkish examples in this thread are different because there is no ellipsis but I think they deserve the same treatment, as we do not have a special set of relations for modifiers of adjectives.

If the above is accepted, then it seems straightforward to also accept çok tagged DET and attached as det to the adjective odalı in (8). But it would also deserve to be described and exemplified in the Turkish-specific documentation, as it is an interesting and peculiar construction, and without explanation the annotation may be surprising to users.

coltekin commented 3 years ago

Joining a bit late, but a few additional remarks:

The problems noted above becomes difficult as some of these "derived" forms are lexicalized. evsiz 'homeless' is likely lexicalized, and in its normal use, you cannot modify ev 'house' here, the word normally refers to a person. However, it is also possible (but not very likely) to have a sentence like Müstakil evsiz yapamam 'I cannot do without a standalone house'. Here, I'd be happy to treat these suffixes as case suffixes (although I do not know any linguist who calls these case markers), after all, Müstakil ev-de yaşıyor 'S/he lives in a standalone house' is not very different. However, there are cases where analysis gets tricky with these suffixes. Modifying one of the examples above,

1. üç        çekmece-li  dolap
   three     drawer-ATTR wardrobe
   'wardrobe with three drawers'
2. üç        çekmece-li-yi          ben aldım
   three     drawer-ATTR-ACC        I   took-PAST-1SG
   'I took the one  with three drawers'

In both cases the numeral modifies the NOUN inside the adjective. Since there is no ambiguity in (1), we may be happy with nmod(çekmeceli/ADJ, üç/NUM) - not really standard or elegant but we can assume that there is some internal structure of the word and the numeral modifies a sub-part. However, this becomes ambiguous, since any ADJ in Turkish can be used as a noun indicating an object with the property specified by the adjective (this is somewhat similar to the case of head promotion). This is what is happening in (2). Here, without segmenting the word, there is no way (I can think of) that tells whether there are three drawers, or three wardrobes. You can check a few additional (real-world) examples here.

In the "annotation guidelines" I am aware of (GB, BOUN, IMST, and even TR-DE SAGT), -lI and -sIz are segmented "if they are not lexicalized". The result is not very consistent. What is 'lexicalized' is generally a difficult decision for the annotators, and this is also not easy for a automatic method to segment reasonably. I agree that we need a better solution for these, but I do not expect to arrive at a good one soon. A good solution should also make sure that we cover the same issues in other Turkic languages, and possibly others like Komi (as noted above) which probably have similar cases. In the short term, I think it would be best to be as compatible with the current treebanks as possible.

ftyers commented 3 years ago

I agree we should maintain what we have at the moment until something actually better comes up. I also agree that the lexicalised/non-lexicalised boundary is extremely difficult to draw, and in reality if we have to draw it without reference to something concrete "in the sentence" (e.g. with modifiers -- split, without -- don't split) then it will tend towards arbitrariness.

As an aside, this example also works with the -ed in English "I took the three-drawered one", but it is a lot more productive in Turkish than in English.

Stormur commented 3 years ago

Acknowledging that this is a process in Turkish to turn nominal phrases into attributes (i.e. to make something that we call NOUN function as an ADJ), I would propose to simply annotate them as NOUNs and nmods, while marking this special attributive form as a morpholexical feature (something like Form=Attributive, or similar). At the same time, for completely lexicalised and crystallised terms like the mentioned evsiz, an analysis as ADJs would be justified, at the same time maintaining the Form=Attributive mark pointing to its still transparent origin.

This would solve all problems of awkward "internal" dependencies: they are completely natural inside an nmod. Besides, this would be a natural parallel with other (or even the same) languages using different strategies, for example nominal dependents introduced by prepositions. It just happens that Turkish uses a suffix (and I would not agree on tokenising it separatley, since it is clearly fused into the word, as e.g. vowel harmony shows).

By the way, even if I am proposing a morphological treatment, I am too against treating the -lI derivation as a case: probably I cannot explain myself well enough, but I think this is one morphological tool that the language has to "change the word class" of a phrase, and not to express that phrase's role in the sentence (which would be a case).