UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 244 forks source link

when to annotate `compound` versus `obj` #1013

Open jonorthwash opened 4 months ago

jonorthwash commented 4 months ago

The documentation on compound states that compound isn't needed just because the meaning is lexicalised or idiomatic, giving example like make a decision.

However, the documentation on compound:lvc gives the example çile çektiler - literally "they endured suffering". This seems fairly compositional and non-idiomatic (unless you translate the verb more generally, e.g. "pulled suffering"). Why wouldn't an example like this be annotated with obj?

More generally, how can we tell when to use compound, and especially compound:lvc? The other two Turkish examples for compound:lvc seem reasonably like light verb constructions, since the verb et- really conveys very little lexical information in those examples. Is the criterion then about semantic content of the verb as compared to e.g. an object?

We're considering English examples like make money, make a decision, give permission, place an order (which feel like a continuum from more LVC-like to less LVC-like) as well as Kyrgyz examples like буюртма бер ("place an order", literally "give an order") as compared to уруксат бер ("give permission"), which differ in that the noun in the former one does not inflect and cannot have dependents, whereas the noun in the latter example can inflect and take dependents (*буюртмамды бердим, уруксатымды бердим). We are also contrasting these with idiomatic expressions consisting of a subject and a verb (e.g., башым айланды "I got dizzy", literally "my head spun").

The question is how to know whether/when to annotate such constructions literally (obj, nsubj) or as compounds (compound, compound:lvc).

dan-zeman commented 4 months ago

I am afraid that the borderline is not very well developed in the UD guidelines. My impression is (disclaimer: at the moment I don't have time to re-read the guidelines, so the impression may be wrong) that we essentially said that if people want to annotate LVCs as something special, they can go with compound:lvc, which should not be used in English, but there are languages where LVCs are much more prominent in the grammar and this might be a solution.

Obviously, this would not be / is not a good guideline when to use it and when to stick with the normal obj solution.

I completely agree with the part that says that semantic compositionality vs. idiomacity should not decide about UD syntactic relations. Which probably means that many such "compounds" are used wrongly and should be obj even in languages where LVCs are frequent (e.g. many Indo-Iranian languages). But as soon as morphology or syntax starts to behave differently from standard object-verb constructions, we can consider a different analysis, such as compound:lvc. So I guess that in your Kyrgyz examples, буюртма would be attached to бер as compound:lvc, while уруксат would be obj.

coltekin commented 4 months ago

I think the concept of light verb construction inherently relies on the verb being semantically light, and the weights of the verbs in these constructions are definitely graded. So, in general, it is really difficult to identify an intransitive :lvc without appealing to semantics. If the verb construction is transitive, then we can tell it with certainty. The Turkish example tercih et- 'to prefer' in compound:lvc is a good example - it takes and accusative object. Not marking it as compound would require having two objects (and, clearly this is not a double-object construction). On the other hand, analyzing tercih et- as compound:lvc and telefon et- 'to phone', which does not take an accusative object (its argument is dative, but no need to go into argument/adjunct discussion), not even as a compound would likely be confusing for the annotators. :lvc is a optional, language specific extension. So, it should be fine to appeal to semantics like other common examples (e.g., :tmod, :arg, :agt for obl). I think the main issue is determining compoundhood of them with certainty.

Returning to the compound definition, I find it is quite confusing. I think the guidelines wants to restrict the choice based on morphosyntactic criteria, but allows some "language specificity", and it is not clear - at least to me - what is exactly makes the positive examples in the documentation positive. From the examples, I get that noun-noun combinations without case marking should be compounds - at least for English. However, I also do not see 3 million dollar loan in the example having any level of non-compositional structure, which is given as the reason for the MWE relations including compounds. I think there is a need for a general rule/guideline here for determining what makes a compound more universally.

Perhaps more important for the current thread, the documentation is mainly focused on noun compounds, not saying much about verbs (except referring to language-specific documentation). It is clear that there are some cross-linguistically similar cases. So, it would be nice to put these together as much as possible, and include in the universal documentation.

jonorthwash commented 4 months ago

But as soon as morphology or syntax starts to behave differently from standard object-verb constructions, we can consider a different analysis, such as compound:lvc. So I guess that in your Kyrgyz examples, буюртма would be attached to бер as compound:lvc, while уруксат would be obj.

I wonder then about why this isn't the case in English, e.g.:

In a world where those '?'s indicate that such expressions are disallowed, then by your suggestion, wouldn't give an order have order as a compound:lvc dependent of give?

nschneid commented 4 months ago

IMO the strongest case for compound:lvc in English would be "have no idea" and similar idioms, because they can occur with a second NP whose function is difficult to classify if "idea" is an object: "I had no idea the time". UniversalDependencies/UD_English-EWT#444 That is, they are LVCs that trigger bizarre syntax. There are only a few expressions like this, though.

Otherwise I'm not sure there is much to be gained by applying compound:lvc in English as part of the syntax. There are all sorts of (semi)idiomatic multiword expressions that are beyond the scope of UD (but the PARSEME project annotates them as a separate layer). Those expressions may resist alternations that are characteristic of fully productive syntax, like modification and passivization (*the bucket was kicked is not a variant of the idiom kick the bucket)—but those alternation tests are used to describe the prototype of the syntactic function, and need not hold true of every single instance.

Taking a global perspective, I see your point that the definition of compound is a bit squishy, but this is probably necessary as different languages have different productive constructions that verge on creating complex lexical items, and are somewhere in between the rest of syntax and morphology. It is related to the issue of tokenization, where we are certainly not close to having a clear-cut universal standard.

Savary et al. 2023 discuss UD and MWEs, and suggest that the :lvc subtype could be removed from UD and relocated to an MWE annotation layer.

jonorthwash commented 4 months ago

There's quite a few types of N+V "compound" in Kyrgyz (and here I use the term to mean that the resulting meaning is not quite what one would expect if it were exactly semantically compositional—although this is maybe hard to make this call on in some instances, and it certainly isn't the deciding factor in UD).

Here's a brief typology I came up with some years ago, with some examples:

N with possessive morphology

N is 3rd person subject

Most of these are conjugated in 3rd person; English subject often corresponds to possessor of noun, overt with genitive case.

N is definite object

seem to be either causatives of above or just compositional?

N is in other cases

N without possessive morphology

"N" is mostly limited in use to compound

"N" is ideophone

N is indefinite object

N is in other cases

Would we want Kyrgyz-specific guidelines for dealing with these, that involve compound relationships for some and not others?

jonorthwash commented 4 months ago

those alternation tests are used to describe the prototype of the syntactic function, and need not hold true of every single instance.

Got it, makes sense.

Taking a global perspective, I see your point that the definition of compound is a bit squishy, but this is probably necessary as different languages have different productive constructions that verge on creating complex lexical items, and are somewhere in between the rest of syntax and morphology. It is related to the issue of tokenization, where we are certainly not close to having a clear-cut universal standard.

It would be nice to know what the intention is / what the guidelines are, even if there's no exact definition currently.

nschneid commented 4 months ago

Potentially there could be Kyrgryz-specific criteria, or perhaps they could just be annotated as objects if there is not anything that makes them strikingly different morphosyntactically from regular (nonidiomatic) objects.

compound:lvc is a subtype that some languages have chosen to adopt, but there is nothing in the UD guidelines (that I know of) that actively recommends it for LVCs in new languages. The subtyping mechanism is mainly designed to give flexibility to treebank developers. A few subtypes are "semi-mandatory" (see here) but :lvc is not one of them. So, unless there is a pressing need to do something syntactically special with LVCs, I would probably not be in a hurry to use it.

sylvainkahane commented 4 months ago

I wonder then about why this isn't the case in English, e.g.: place an order: placed my order, placed ten orders give an order: ?gave my order, ?gave ten orders (I've found examples of the latter, but they feel a little weird to me) In a world where those '?'s indicate that such expressions are disallowed, then by your suggestion, wouldn't give an order have order as a compound:lvc dependent of give?

Both are LVCs but with different senses of ORDER. I think that the contrast come from the properties of these two different senses (ORDER2 is less countable). From the syntactic point of view, both constructions are obj without any discussion.

I really agree with @dan-zeman and @nschneid that compound:lvc must be used when the LVC shows particular properties of cohesion, which suggest that it is no longer an object construction. The presence of another object, in a language where double object constructions are not regular constructions, is one of the possible justifications for introducing compound:lvc. More generally, compound should used for particular constructions that show a high level of syntactic cohesion and cannot easily be described with other relations.

In the French treebanks, we have introduced the relation obj:lvc for one reason: we consider that it is a regular obj relation but we wanted to indicate that in this case the complement of the (predicative) noun must be considered as the complement of a verb construction and not as a noun complement. Example:

Bill a besoin de sous. 'Bill needs money.', lit. Bill has need of money. obj:lvc(a,besoin) obl:arg(besoin,sous), rather than nmod(besoin,sous)

It is a case where MWE annotation and syntactic annotation intersect, at least with UD annotation scheme which distinguish obl and nmod.

nschneid commented 4 months ago

@sylvainkahane Interesting—I see the problem but must confess I find it counterintuitive to have obl:arg(besoin/NOUN, sous). From a basic UD perspective sous has to be either adnominal or adverbial, and this feels like it is trying to have it both ways (adnominal attachment but adverbial deprel). I think a noun is supposed to have obl dependents only if it stands for a clause, e.g. in a copular clause or when it is promoted due to ellipsis.

jonorthwash commented 3 months ago

Bill a besoin de sous. 'Bill needs money.', lit. Bill has need of money. obj:lvc(a,besoin) obl:arg(besoin,sous), rather than nmod(besoin,sous)

In my current thinking, I'm very sympathetic to something like this approach, except:

I think a noun is supposed to have obl dependents only if it stands for a clause, e.g. in a copular clause or when it is promoted due to ellipsis.

So I'd go with obl:arg(a,sous) instead of making the head of the object besoin. Then the verb is a besoin, and the oblique object, marked with de, is de sous.

I believe this is how I'd handle Yiddish ליב האָבן (lib hobn) "like", e.g. זײ האָבן ליב דעם הונט (zey hobn lib dem hunt), "they like the dog", with an accusative object in addition to ליב (lib). Or maybe this is okay to call compound:lvc? And זײ האָבן עס ניט ליב (zey hobn es nit lib) "they don't like it" would just have intervening obj and advmod, something like this?

1   זײ  _   PRON    _   _   2   nsubj   _   _
2   האָבן   _   VERB    _   _   0   root    _   _
3   עס  _   PRON    _   _   2   obj _   _
4   ניט _   ADV _   _   2   advmod  _   _
5   ליב _   PART    _   _   2   compound:lvc    _   _
nschneid commented 3 months ago

I'm guessing the Yiddish idiom is analogous to English "have no idea" (see above), where the clause has a separate object in a productive slot? Some sort of compound is plausible there, yes.

dan-zeman commented 3 months ago

I believe this is how I'd handle Yiddish ליב האָבן (lib hobn) "like", e.g. זײ האָבן ליב דעם הונט (zey hobn lib dem hunt), "they like the dog", with an accusative object in addition to ליב (lib). Or maybe this is okay to call compound:lvc? And זײ האָבן עס ניט ליב (zey hobn es nit lib) "they don't like it" would just have intervening obj and advmod, something like this?

I would probably use xcomp from hobn to lib.

nschneid commented 3 months ago

Why xcomp? Is lib a secondary predicate?

dan-zeman commented 3 months ago

Why xcomp? Is lib a secondary predicate?

Yes, that would be the interpretation of the xcomp.

Stormur commented 3 months ago

Why xcomp? Is lib a secondary predicate?

Yes, that would be the interpretation of the xcomp.

Are secondary predicates not more advcl (we even use advcl:pred for them)? I take xcomp to mean a (I would say "required", if this would not entail an adjunct/complement distinction...) complement of a verb, something in its valency.

Stormur commented 3 months ago

I am starting with a general consideration:

The question is how to know whether/when to annotate such constructions literally (obj, nsubj) or as compounds (compound, compound:lvc).

My maybe for many too radical solution would be to ditch compound altogether. There is not a single instance it is applied to that is not controversial or does not appear to be incredibly language-specific. A parallel MWE annotation layer or relation subtypes is the way to go, in my opinion. So for me, in all cases discussed here, we have just objs.


Bill a besoin de sous. 'Bill needs money.', lit. Bill has need of money. obj:lvc(a,besoin) obl:arg(besoin,sous), rather than nmod(besoin,sous)

It is a case where MWE annotation and syntactic annotation intersect, at least with UD annotation scheme which distinguish obl and nmod.

I do not really like this approach, because it mixes things. If besoin is annotated as an obj, and more in general as a NOUN with a dependent, I see no reason in not using nmod if it is in some way similar to what happens in Italian with bisogno di soldi = besoin de sous :

I think that the solution is to acknowledge that have-verbs are auxiliaries exactly like be-verbs, and I suppose this is the insight behind such a mixed annotation. Then besoin would rightly be the head of the clause, but in any case this would not change the status of its nmod.

I think a noun is supposed to have obl dependents only if it stands for a clause, e.g. in a copular clause or when it is promoted due to ellipsis.

This still will not happen if a whole noun phrase appears in a copula. There might be obl dependents, but the same way as with any other predicates. And obl is not thinkable with promotion: in that case orphan is the only choice not to create a non-existent construction.

IMO the strongest case for compound:lvc in English would be "have no idea" and similar idioms, because they can occur with a second NP whose function is difficult to classify if "idea" is an object: "I had no idea the time". UniversalDependencies/UD_English-EWT#444 That is, they are LVCs that trigger bizarre syntax. There are only a few expressions like this, though.

These are interesting cases. Though, is it not a little cherry-picking to base a "non-canonical" annotation for all occurrences of a given construction on outliers? Is it not possible to think of solutions for these cases specifically? For example:

I would favour the second solution, as it is the most straightforward and in line with other syntactic observations (also focusing, topicalisation, etc., so maybe even dislocated might apply).

I envision also a third way, as discussed for the previous Yiddish example and maybe applicabile to Turkish tercih et etc., i.e. secondary predication. It seems to me the correct way to represent what is happening there, especially when make-verbs are involved, for example "I make something (obj) my favourite (xcomp/advcl:pred)". Then again, it could also be possible to consider some of these verbs auxiliaries tout court.

nschneid commented 3 months ago

I think that the solution is to acknowledge that have-verbs are auxiliaries exactly like be-verbs

TBC are you proposing a second copula (attaching as cop)? Right now the validator only allows one copula per language.

Stormur commented 3 months ago

I think that the solution is to acknowledge that have-verbs are auxiliaries exactly like be-verbs

TBC are you proposing a second copula (attaching as cop)? Right now the validator only allows one copula per language.

Yes, I am thinking of that. The difference would be that one copula is intransitive, the other transitive. Of course, this raw idea needs to be elaborated further, but I think it is promising.

Anyway, it would not change the nmod/obl issue discussed before.

coltekin commented 3 months ago

I could not follow the discussion for a while, apologies if some of these were discussed earlier. First, I think there is a difference between the compound(:lvc) and (multiple) objects, or secondary predication. I think (noun-verb) compounds should not be considered like standard/syntactic object-verb constructions. They do behave differently. I think LVC's are compounds but we cannot just rely on semantic lightness of the verb to determine this.

Fortunately, there seems to be some tests. Some of them can be applied probably for most - if not all languages. For Kyrgyz, I am sure we would at least find some guidelines that is valid for most (all?) Turkic languages. The tests I could collect are:

This is not an exhaustive list. Many (but not all) are also applicable to constructions with case marked and possessive nouns listed above. Probably we can find out/come up with more tests as well. Not all of these work on all cases, and there will definitely be leaks, but I don't think we are helpless for determining noun-verb compounds.

If these constructions are rare in the language, and does not result in transitive verbs, the choice of treating them as verb-object constructions is maybe understandable. However, particularly for Turkish/Turkic I think this would make the analyses quite incoherent.

Stormur commented 3 months ago

I am not so sure about the validity of these tests. To me they seem very often to depend on the semantics of these nouns. So for example it might be that English treats shower in a way, as an uncountable entity, while in Italian you can well say

fare 'to do/make' is also a rather weak verb in Italian, but I would not end up saying that "take a shower" in English is a compound, while in Italian it is not. They really look the same to me, but then English treats some nouns differently.

The really important observation is about transitivity, as in your example

But then I do not understand what a compound VERB+NOUN should be, since compound has been defined as a kind of nominal modifier, and this seems to me an undue extension of the tag. The two (compatible) ways of dealing with this are:

The teatment of etmek as an auxiliary seems to be shared by some lexicographic sources (e.g. Wiktionary, the first I could find).

Such an annotation would be more in line with other constructions. It would avoid the too wide range of compound and the consequent specific treatments for it that one has to include in any query, disrupting them.

jonorthwash commented 3 months ago

The teatment of etmek as an auxiliary seems to be shared by some lexicographic sources (e.g. Wiktionary, the first I could find).

What does "auxiliary" mean though? In Turkic languages, it usually refers to a verb that occurs as part of a single predicate with a non-finite verb form, such as şarkı söyleyip durdu ‘they kept singing’.

AUX is obviously broader than that, and includes e.g. the defective copula verb in Turkic languages too. But etmek feels more or less like a transitive verb, just with very "light" semantics.

Stormur commented 3 months ago

Hm, I would have said that AUX in UD is much narrower than for many other formalism! I just think of so-called modal and/or copular verbs: e.g. I think that seem would be treated as an auxiliary in some accounts, but usually it is not in UD. Similarly, if asked impromptu I would not label that durdu as AUX.

I would try to define an auxiliary/AUX as a verbal element* which helps another (verbal or non-verbal element) to form a predicate, without carrying any content (just grammatical categories).

From what I am seeing here, etmek seems to fit this description in that it is so "light" that it even "loses" its object in favour of the true lexical head. It is just a support to form a transitive predicate. If kabul means 'acceptance', kabul etmek is 'to accept (smth)'; if tercih means 'preference', tercih etmek is 'to prefer (smth)'. This looks really regular, and all semantics are carried by the NOUN, which might even be a cranberry word (right?) if it appears only in such a construction. telefon etmek would also be regular in the same sense, because it is "the transitive action towards someone through the telephone", so 'to call (smb) by telephone'.

This seems very much stronger than supposedly light verbs like it. fare 'to do/make', which always keeps its transitive structure with a noun, no matter what (but then, it can also act like a "causative auxiliary" with other verbs). etmek seems to have gone a step farther, becoming a functional element. Also in söyleyip durdu the verb durdu still contributes to the content. Maybe it is more in the background, but I would not call it a copula (yet).

Then, interestingly, the trend that we observe is that grammatical functions are devolved to the more functional element, while the lexical one is a "less finite" form. But I would not say this is a necessary nor a sufficient condition, just a general trend (observed e.g. in articles retaining case distinctions more than nouns, and so on).


* I know that AUX has been extended also to non-verbal elements, but this might not be relevant here.

coltekin commented 3 months ago

[I am adding some more data to the main point, I'd be very happy to discuss some of the other questions matters above, but I am afraid it may cause too much diversion from the original issue.]

I do not think etmek in Turkish is AUX, it does not just carry verbal features, it is the predicate. Yet, the lexical meaning is only complete with another word. The meaning is loosely 'to do/to make', but it is rarely used as a main verb. It is true that we have some tendencies towards regular/compositional (meaning) added by the verb. In the example above telefon et, may be considered so ('doing phone(ning)'), but in other cases the 'to do' sense is not around at all (e.g., baş etmek 'to overcome (literally: head do)'. To my understanding, AUX is a syntactic dependency, its morphological analogue is inflection. The verbs like etmek here, cause a semantic change, they are close to what derivational morphology does. In fact, for most native speakers, it is also one of the common typos (e.g., many people would write başetmek without a space). Further, some are already written without spaces zulmetmek 'to torture / cause suffering', bahşetmek 'deign / grant'. And, this is not only about the etmek, there are others where we see similar issues.

I don not like the idea of analyzing telefon etmek (intransitive - dative argument) tamir et 'repair' (transitive) differently. These are very similar (lexical) constructions. Distinguishing these two forms because one is transitive and other is not produces inelegant analyses.

Also, the lexicalized/MWE use and the productive/syntactic use may both be available in some cases. For example:

(1) Bunlar birçok can ve mal kaybına neden olmaktadır. 'These cause many damages to life and property.' [BOUN ins_1502]

(2) Bunu (bir) neden olmadan yapamayız. 'We cannot do this without (having/being) a reason/justification.'

In (1) neden olmak is 'to cause' (intransitive - dative argument), and it is a MWE, in (2) neden 'reason' is the object of the verb olmak 'to be'. The structure is much more rigid in (1) than (2), and even though the verbal compound in (1) would not take an accusative object (it would take a dative argument), neden in (1) is still not an object. in (2) neden is clearly the object. I do not think we should be annotating these the same way.

In short, I am pretty certain that these are MWEs, and should not be analyzed using usual syntactic dependencies (like obj). In fact, many Turkic treebanks already annotate them with compound:lvc. What we lack is tests that could tell these apart. I realize that the tests above are not perfect, but it may be a good starting point.

Stormur commented 3 months ago

I would like to answer to comment these points, and I am convinced they are quite relevant in a discussion about compound and obj.


I do not think etmek in Turkish is AUX, it does not just carry verbal features, it is the predicate. Yet, the lexical meaning is only complete with another word. The meaning is loosely 'to do/to make', but it is rarely used as a main verb.

You are rather precisely giving the definition of auxiliary (AUX). Any auxiliary is (part of) a predicate: in European languages it often happens that it is even the "grammatical locus".

AUX is a syntactic dependency, its morphological analogue is inflection.

Maybe I am pedantic, and I do not know if this was an error, but the dependency is aux, while AUX is an element realising it. I would say that an AUX is an element that always, or almost always, has this behaviour (as you note for etmek, as it is for to be in Englisch etc.), so most of the time it will depend as aux (but then there are ellipses, too). I think that functional relations are already considered inflection in UD: what changes is whether this inflection happens through bound or free morphs.

The verbs like etmek here, cause a semantic change, they are close to what derivational morphology does.

But even "non-derivational" morphology brings about semantic changes: for example, the anchoring to a time (Tense) is semantic, cases express the function of arguments and so can be interpreted semantically...

An important aspect is regularity. A construction like etmek seems to be extremely predictable. At the same time, derived adjectives in en. -ous only transmit a vague relation to the noun base, so petalous has something to do with petals, but what exactly is left to context. Also, another thing is if there are real alternatives to etmek to form such predicates.

common typos (e.g., many people would write başetmek without a space). Further, some are already written without spaces zulmetmek 'to torture / cause suffering', bahşetmek 'deign / grant'. And, this is not only about the etmek, there are others where we see similar issues.

This really strengthens a functional reading of etmek. Is this not very similar to the evolution of -bil- and -yor-?

I don not like the idea of analyzing telefon etmek (intransitive - dative argument) tamir et 'repair' (transitive) differently.

From the examples before, I understood that telefon etmek is transitive... can it be both or did I understand wrong?

(1) Bunlar birçok can ve mal kaybına neden olmaktadır. 'These cause many damages to life and property.' [BOUN ins_1502]

In (1) neden olmak is 'to cause' (intransitive - dative argument), and it is a MWE, in (2) neden 'reason' is the object of the verb olmak 'to be'. The structure is much more rigid in (1) than (2), and even though the verbal compound in (1) would not take an accusative object (it would take a dative argument), neden in (1) is still not an object.

From your description, I get that olmak is a copula, so the dependency here is cop(neden,olmaktadır), neden is the root, and as an intransitive predicate we cannot have an object. Literally maybe "they are the reason for many damages to life and property"?

(2) Bunu (bir) neden olmadan yapamayız. 'We cannot do this without (having/being) a reason/justification.'

in (2) neden is clearly the object. I do not think we should be annotating these the same way.

And here I understand that yapmak is a fully lexical verb (by the way, could it be substitued for etmek here?)

I agree the two sentences are different constructions (a copular and a transitive one). But if we eventually find that etmek behaves more like olmak than yapmak, then they should be annotated the same (or an equivalent) way.

In short, I am pretty certain that these are MWEs, and should not be analyzed using usual syntactic dependencies (like obj).

This is a crucial point. MWE annotation is a different level than syntax (see e.g. the work on PARSEME): it should not be let percolate into it. In my opinion, doing so with relations like compound:lvc introduces interferences and blurs morphosyntactic annotation.

coltekin commented 3 months ago

I would like to answer to comment these points, and I am convinced they are quite relevant in a discussion about compound and obj.

I'll try.

I do not think etmek in Turkish is AUX, it does not just carry verbal features, it is the predicate. Yet, the lexical meaning is only complete with another word. The meaning is loosely 'to do/to make', but it is rarely used as a main verb.

You are rather precisely giving the definition of auxiliary (AUX). Any auxiliary is (part of) a predicate: in European languages it often happens that it is even the "grammatical locus".

I disagree, the UD documentation says

An aux (auxiliary) of a clause is a function word associated with a verbal predicate that expresses categories such as tense, mood, aspect, voice or evidentiality.

The issue with et- is that et- normally combines with nouns. VERB + et is very rare (I've only seen real examples in code-switching corpora with foreign verbs). If the noun (before having "associated" with et-) had some predicative function, I'd be more willing to agree for aux, but the group of words (e.g., telefon et-) becomes a predicate only when both are together.

Furthermore, syntactically et- does not have the effect of complementing the predicate with additional TAME.

AUX is a syntactic dependency, its morphological analogue is inflection.

Maybe I am pedantic, and I do not know if this was an error, but the dependency is aux, while AUX is an element realising it. I would say that an AUX is an element that always, or almost always, has this behaviour (as you note for etmek, as it is for to be in Englisch etc.), so most of the time it will depend as aux (but then there are ellipses, too). I think that functional relations are already considered inflection in UD: what changes is whether this inflection happens through bound or free morphs.

Yes you were pedantic ;-) It was meant to be aux (maybe my subconscious was trying to rule out cop , too). It is also not like be in English, and even though there is some "default" (semantic) function that is rather predictable/compositional, it is not always so.

The verbs like etmek here, cause a semantic change, they are close to what derivational morphology does.

But even "non-derivational" morphology brings about semantic changes: for example, the anchoring to a time (Tense) is semantic, cases express the function of arguments and so can be interpreted semantically...

An important aspect is regularity. A construction like etmek seems to be extremely predictable. At the same time, derived adjectives in en. -ous only transmit a vague relation to the noun base, so petalous has something to do with petals, but what exactly is left to context. Also, another thing is if there are real alternatives to etmek to form such predicates.

I agree. And, I would be willing to "invent" a function (more than considering it a stand-alone predicate with one or more objects) in syntax to assign to et-, if it was very predictable. But it is not, and this structure/construction is not specific to et-, there are other non-verb constructions with similar behavior. et- (ele-, kıl- in other Turkic languages) turns out to be the most productive with respect to what they can combine with. However, for a syntactic/inflectional construction, we'd expect it to be less selective. You cannot just combine et- with any noun: *kitap et-, *masa et-, *bilgisayar et - at least at this point in time.

common typos (e.g., many people would write başetmek without a space). Further, some are already written without spaces zulmetmek 'to torture / cause suffering', bahşetmek 'deign / grant'. And, this is not only about the etmek, there are others where we see similar issues.

This really strengthens a functional reading of etmek. Is this not very similar to the evolution of -bil- and -yor-?

This is a good point, but bil- and -yor attach to predicates (verbs), and they add TAME features. So, they fit the bill for aux perfectly in their free form. And it is easy to assign an inflection-like feature when they are bound. I cannot say the same for et-.

I don not like the idea of analyzing telefon etmek (intransitive - dative argument) tamir et 'repair' (transitive) differently.

From the examples before, I understood that telefon etmek is transitive... can it be both or did I understand wrong?

For UD, telefon et- is intransitive. We do not telefon et- 'someone' but 'to someone' (it has a dative argument, which in UD is obl). It is only intransitive in that sense.

(1) Bunlar birçok can ve mal kaybına neden olmaktadır. 'These cause many damages to life and property.' [BOUN ins_1502] In (1) neden olmak is 'to cause' (intransitive - dative argument), and it is a MWE, in (2) neden 'reason' is the object of the verb olmak 'to be'. The structure is much more rigid in (1) than (2), and even though the verbal compound in (1) would not take an accusative object (it would take a dative argument), neden in (1) is still not an object.

From your description, I get that olmak is a copula, so the dependency here is cop(neden,olmaktadır), neden is the root, and as an intransitive predicate we cannot have an object. Literally maybe "they are the reason for many damages to life and property"?

I do not think ol- is a copula. In fact, I do not think it is ever a copula in modern Turkish (it is in some other Turkic languages). It has an auxiliary function, but in this function it always attaches to predicates. Otherwise, it is, as far as I can tel, the fully lexical verb 'to become'. The literal translation may have been correct at some point in time. It may also be the reason for the current usage. However, for a Turkish speaker, the copular construction corresponding to they are the reason for many damages to life and property would be Bunlar birçok can ve mal kaybına nedendirler. Natural copula is an affix, with a limited/marked usage of a form i-. If you use ol- instead of i- (or the suffixe version), it would mean "they become the reason" normally. Semantics make it difficult to construct it here, but there is also a reading for neden ol- as "to become the reason". For example, I could easily say that başarının neden-i oldu 'he/she became the reason for success' (unambiguously with the help of the accusative marker), and başarıya neden oldu is ambiguous between 'he/she became the reason for success' and 'he/she caused the success', but no sense of 'he/she/it was the reason'.

(2) Bunu (bir) neden olmadan yapamayız. 'We cannot do this without (having/being) a reason/justification.' in (2) neden is clearly the object. I do not think we should be annotating these the same way.

And here I understand that yapmak is a fully lexical verb (by the way, could it be substitued for etmek here?)

Yes, yap- is a fully lexical verb (but it may also participate in similar constructions). And, if I understand the question correctly we cannot replace it with et-: *Bunu (bir) neden olmadan edemeyiz. at least not in standard Turkish (it may be acceptable in some dialects).

I agree the two sentences are different constructions (a copular and a transitive one). But if we eventually find that etmek behaves more like olmak than yapmak, then they should be annotated the same (or an equivalent) way.

If neden ol-du was a proper copular construction ('was the/a reason'), We'd expect 'doktor ol-du' to also mean 'he/sh was the/a reason', but the second one is simply 'he became a doctor'.

One more argument against ol- as copula: it can be passivized. neden ol-un-du is perfectly fine.

In short, I am pretty certain that these are MWEs, and should not be analyzed using usual syntactic dependencies (like obj).

This is a crucial point. MWE annotation is a different level than syntax (see e.g. the work on PARSEME): it should not be let percolate into it. In my opinion, doing so with relations like compound:lvc introduces interferences and blurs morphosyntactic annotation.

As I understand, compound is one of the relations to annotate some types of MWE in UD. The documentation does not list noun-verb compounds, but if verb-verb verb-particle compounds are also fine, I think these also fit the usage. Just a final note: many of these would normally be in the dictionaries. For example, both telefon etmek and neden olmak are in the "official" dictionary, presumably because of this lexicalized/non-compositional status. While many of the noun-noun compounds annotated as compound in the UD English treebanks would have a more compositional meaning and may not be listed in dictionaries.

Stormur commented 3 months ago

Thanks for the comments and the discussion. They help a lot making the situation clearer! And I hope not to sound too grumpy in written form, it really is not the case :slightly_smiling_face:

I do not think etmek in Turkish is AUX, it does not just carry verbal features, it is the predicate. Yet, the lexical meaning is only complete with another word. The meaning is loosely 'to do/to make', but it is rarely used as a main verb.

You are rather precisely giving the definition of auxiliary (AUX). Any auxiliary is (part of) a predicate: in European languages it often happens that it is even the "grammatical locus".

I disagree, the UD documentation says

An aux (auxiliary) of a clause is a function word associated with a verbal predicate that expresses categories such as tense, mood, aspect, voice or evidentiality.

The issue with et- is that et- normally combines with nouns. VERB + et is very rare (I've only seen real examples in code-switching corpora with foreign verbs). If the noun (before having "associated" with et-) had some predicative function, I'd be more willing to agree for aux, but the group of words (e.g., telefon et-) becomes a predicate only when both are together.

Furthermore, syntactically et- does not have the effect of complementing the predicate with additional TAME.

Here we are probably confusing the part of speech AUX and the related deprel aux. I was referring to the former: an AUX like eng. to be can combine with anything, also NOUNs, but then it has relation cop, and I was actually thinking of that (maybe I got confused later on). etmek might be specialised to combine with NOUNs. I also meant that any copula complements TAME categories to its non-verbal part, e.g. NOUNs do not show Tense, Mood, etc.

The verbs like etmek here, cause a semantic change, they are close to what derivational morphology does.

But even "non-derivational" morphology brings about semantic changes: for example, the anchoring to a time (Tense) is semantic, cases express the function of arguments and so can be interpreted semantically... An important aspect is regularity. A construction like etmek seems to be extremely predictable. At the same time, derived adjectives in en. -ous only transmit a vague relation to the noun base, so petalous has something to do with petals, but what exactly is left to context. Also, another thing is if there are real alternatives to etmek to form such predicates.

I agree. And, I would be willing to "invent" a function (more than considering it a stand-alone predicate with one or more objects) in syntax to assign to et-, if it was very predictable. But it is not, and this structure/construction is not specific to et-, there are other non-verb constructions with similar behavior. et- (ele-, kıl- in other Turkic languages) turns out to be the most productive with respect to what they can combine with. However, for a syntactic/inflectional construction, we'd expect it to be less selective. You cannot just combine et- with any noun: *kitap et-, *masa et-, *bilgisayar et - at least at this point in time.

It would be interesting to investigate the distribution of these other verbs. Anyway, selectivity becomes relevant only if etmek does appear in other contexts: but if nearly the totality of its appearances are in similar "copular constructions", then this would strengthen a treatment as AUX.

I am not convinced about inventing new functions... how applicable could they be? Maybe a specific ones for "light verbs"?

(1) Bunlar birçok can ve mal kaybına neden olmaktadır. 'These cause many damages to life and property.' [BOUN ins_1502] In (1) neden olmak is 'to cause' (intransitive - dative argument), and it is a MWE, in (2) neden 'reason' is the object of the verb olmak 'to be'. The structure is much more rigid in (1) than (2), and even though the verbal compound in (1) would not take an accusative object (it would take a dative argument), neden in (1) is still not an object.

From your description, I get that olmak is a copula, so the dependency here is cop(neden,olmaktadır), neden is the root, and as an intransitive predicate we cannot have an object. Literally maybe "they are the reason for many damages to life and property"?

I do not think ol- is a copula. In fact, I do not think it is ever a copula in modern Turkish (it is in some other Turkic languages). It has an auxiliary function, but in this function it always attaches to predicates. Otherwise, it is, as far as I can tel, the fully lexical verb 'to become'. The literal translation may have been correct at some point in time. It may also be the reason for the current usage. However, for a Turkish speaker, the copular construction corresponding to they are the reason for many damages to life and property would be Bunlar birçok can ve mal kaybına nedendirler. Natural copula is an affix, with a limited/marked usage of a form i-. If you use ol- instead of i- (or the suffixe version), it would mean "they become the reason" normally. Semantics make it difficult to construct it here, but there is also a reading for neden ol- as "to become the reason". For example, I could easily say that başarının neden-i oldu 'he/she became the reason for success' (unambiguously with the help of the accusative marker), and başarıya neden oldu is ambiguous between 'he/she became the reason for success' and 'he/she caused the success', but no sense of 'he/she/it was the reason'.

Then ol- reminds me of the ambiguous status of fieri, also ca 'become', in Latin.

OK, so ol- does not look like a copula, I was mislead by the translation. I am curious about that transitive construction with an accusative marker, though. Maybe I am confused because here we are using an intransitive, copular construction with become to express it in English.

As I understand, compound is one of the relations to annotate some types of MWE in UD. The documentation does not list noun-verb compounds, but if verb-verb verb-particle compounds are also fine, I think these also fit the usage. Just a final note: many of these would normally be in the dictionaries. For example, both telefon etmek and neden olmak are in the "official" dictionary, presumably because of this lexicalized/non-compositional status. While many of the noun-noun compounds annotated as compound in the UD English treebanks would have a more compositional meaning and may not be listed in dictionaries.

In a sense it is, and this is a problem when MWE are such more from a semantic than morphosyntactic perspective. I admit I also have problems with the extension to verbal particles, because this is

I do not think that dictionaries should be a decisive criterion, they follow different logics, in fact mixing morphosyntax with lexical levels.