UD's fundations: functionalism vs distributionalism

UniversalDependencies / docs

Universal Dependencies online documentation

http://universaldependencies.org/

Apache License 2.0

273 stars 248 forks source link

UD's fundations: functionalism vs distributionalism #1063

Open sylvainkahane opened 3 weeks ago

sylvainkahane commented 3 weeks ago

In the long discussion #1059, @jnivre has defended the fact that syntactic relations must be defined on a functionalist ground and that, for instance, all genitive constructions must be nmod, whether they involve a noun or a pronoun.

On the other hand, if we look at the English treebanks (which is the only set of treebanks that all of us can easily explore and which are consequently our unavoidable references), we see that there are three different syntactic constructions where a noun depends on a noun, which are clearly distinguished by the use of three different syntactic relations:

nmod for "N of N"
nmod:poss for "N's N"
compound for "N N"

It is what I would call a distributionalist approach of the syntactic relations. Syntactic constructions that can be clearly distinguished in the language by distributional/syntactic properties are distinguished.

I think that both approaches, functionalist and distributionalist, are useful. But the UD tagset must be clarified and should not mix both approaches at the same level. For instance, in the case of the three constructions in English where a noun depends on a noun, we should have the relation nmod and an (optional) subrelation indicating the particular constructions. For instance:

nmod:adp for "N of N"
nmod:poss for "N's N" (or maybe nmod:det because "N's" are in the same syntactic position as determiners in English)
nmod:compound for "N N"

I don't think that compound is really justified on a functionalist ground.

sylvainkahane commented 3 weeks ago

I can give another example. English has (at least) four syntactic constructions where a clause modifies a noun:

participial clauses: the book given by Zoe
relative clauses: the book (that) Zoe gave
complement clauses (?): the fact that Zoe gave the book
infinitive clauses: a book to give to Bill

It would be justified, from the distributionalist point of view, to distinguish these three constructions. The English treebanks only distinguish the relative clause, with the relcl subrelation. I think that relcl, in this case, corresponds to a particular construction of English. It is interesting to try to give a universal definition of this construction, because other languages have a similar construction (for instance "A relative clause is an instance of acl, characterized by finiteness and usually omission of the modified noun in the embedded clause."). But our guidelines, must be very clear the relcl should only be used in languages that have such a construction (finiteness and omission). Note that the participial clause is also a very particular construction which deserves a subrelation (non-finiteness and omission), which could be distinguished from the infinitive clause (also non-finiteness and omission). As well as the third construction (complement construction?) (finiteness and no omission). It could also be interesting to distinguish constructions involving a (relative) pronoun from constructions involving a pure relativizer (even if is difficult and often controversial, even for very well studied language such as English or French).

Anyway, the fact to only consider the relative clause (<- a fifth construction: non-finiteness and no omission) could be justified for English, but universal guidelines should consider all the possibilities and propose a more complete terminology. It would avoid many of the inconsistencies we find today in the annotation of adjectival clauses (acl).

nschneid commented 3 weeks ago

Thanks for a nice synopsis of these two ways of thinking about dependency relations.

I think that both approaches, functionalist and distributionalist, are useful. But the UD tagset must be clarified and should not mix both approaches at the same level.

My understanding is that the main/universal relation is meant to follow the functionalist approach, whereas subtypes (if present) are more language-specific and often follow a distributionalist approach. So the main relation and the subtype represent two different levels.

On some of the specifics:

The use of plain compound in English always been for what might be considered a nominal modification construction. It is tricky because this construction is traditionally called a "compound", because it serves a similar function as compounds in other languages (prototypically, it creates complex concepts like "kitchen table"), though the morphosyntactic details in e.g. German are different and more clearly lean toward a complex-lexeme interpretation.
Plain nmod should be understood as shorthand for what might be called nmod:adp. It is just so frequent that it seems cumbersome to bother annotators with the subtype.
As you say, :relcl as applied in English follows distributional rather than functional criteria. It might more precisely be termed the English Unbounded Dependency Relative Clause Construction. I recall another thread where a more universal definition of relative clause was discussed. A typologically-based definition of relative clause seems worth pursuing (would it warrant bringing back relcl as a main relation in UDv3?).

amir-zeldes commented 3 weeks ago

nmod:compound for "N N"

@sylvainkahane I agree with the point about nmod:poss vs. nmod, but that is covered by nmod:poss being a subtype, so as discussed above, we can think of nmod in English as nmod:adp. The same logic applies to acl:relcl - as a subtype, nothing about that violates universality - it's clear that in some languages there is no clear distinction between relative clauses and other types of adnominal adjunct clauses, but you can find all adnominal clauses by ignoring the subtypes.

However compound is different - the guidelines state that compound is used for "combinations of lexemes that morphosyntactically behave as single words", which is something quite different from modified nouns. These have some properties that are purely form based (for example lack of separate definiteness marking in English, a very marginal status for pluralizability of the modifier, compound stress) and some properties which are lexical (idiosyncratic meaning) or semantic (lack of referenceability for modifiers, e.g. no pronominalization). These properties clearly set compounds apart from nmods, and they are a typologically widespread phenomenon, though admittedly the details of what constitutes a compound in a given language do vary somewhat.

would it warrant bringing back relcl as a main relation in UDv3?

My sense is that there are too many languages that do not clearly distinguish relative clauses, so while I support strongly recommending the subtype for languages with the distinction, I think making it a universal category would make generalizations about adnominal clauses would be harder if it was made a major type.

jnivre commented 3 weeks ago

I think this discussion is very important, especially looking forward to a potential version 3 of the guidelines.

Unfortunately, I don't think it is as simple as universal relations always being based on functional criteria. Functional criteria works for relations that are (part of) constructions in Croft's sense, such as nsubj' andobj', but some universal relations in UD, such as cop and case, refer to strategies, which are defined by function as well as form (although the form has to be cross-linguistically identifiable, which makes distributional criteria hard to apply). And then there are relations which are not functional at all, like flat, fixed and goeswith. I don't think it will be possible to come up with a taxonomy where all relations are based on the same type of criteria, but one of the things I would hope for in v3 is a slightly more systematic approach. And the main purpose of the "UD constructicon" project that I am trying to get going is to develop a better understanding of what such an approach could look like.

An independent problem with the current way of representing syntactic relations in UD is that the subtyping mechanism is extremely crude and has to do double duty in cross-linguistically prominent subtypes, like "acl:relcl" and "nsubj:pass", as well as more truly language-specific phenomena. Since subtypes are furthermore both atomic and non-recursive, the expressivity is severely limited, which means that many interesting subtypes cannot be represented at all because some other subtype has been given priority. Another desideratum for v3 is therefore to have a more expressive mechanism for subclassifying syntactic relations, in the same way that we can subclassify morphological categories using features.

sylvainkahane commented 3 weeks ago

@jnivre I agree with you that the functional definition of syntactic relations can only concern the relations between content words in UD due to particular status given to function words in UD. If flat itself is rather underspecified, a relation such as flat:name is functionnally defined (and it's more than 90% of the occurrences of flat in the English treebanks). And fixed can really not at all be considered as distributionally defined.

@amir-zeldes I think that the current definition of compound is ineffective. It is totally unclear what "morphosyntactically behav[ing] as single words" means. Most phrases behave as single words, as soon as they can be replaced by a pronoun. Any VP behave as a single word and can be replaced by a single verb. If you consider that N N compounds are words, compound is no longer a syntactic relation and it must be clarified in the guidelines. If you consider that compound is a syntactic relation between two nouns in English, clearly the first noun depends on the second one and I don't understand why it is not a particular case of "a noun modifying a noun", what nmod means. Yes, this syntactic construction has specific properties, what I would call a strong cohesiveness, which make the dependent not very independent (no pun intended). There is no clear difference between the Lebesgue theorem and Lebesgue's theorem and from the functionalist point of view I think we must consider that there are two strategies for the same construction in Croft's sense.

nschneid commented 3 weeks ago

For UDv3 I think we can consider the possibility that compound is so overloaded in grammar that it is not good to use as a technical term if we want to give it a universal interpretation. (That argument also applies to iobj.)

amir-zeldes commented 3 weeks ago

It is totally unclear what "morphosyntactically behav[ing] as single words" means

@sylvainkahane I totally agree that this is not a sufficient definition, though this is true of many deprels, especially at the universal level. There is a lot of literature on what compounds are and aren't typologically, but I think it really only makes sense as an annotation guideline to consider it on a language-by-language basic. In Semitic languages, there is a tradition regarding construct states as compounds, even though they are much more flexible than, say, English compounds. By contrast, compounds in German are less flexible than English ones - yet the term compound is still traditionally applied. At the end of the day, as Croft also pointed out as early as Radical Construction Grammar, and earlier, there are no 1:1 correspondences across languages. But I think as a project that serves the linguistic community, UD can still be helpful in labeling some things as compound across languages, matching what we generally expect from the linguistic literature on those languages, at least in some kind of Basic Linguistic Theory.

There is no clear difference between the Lebesgue theorem and Lebesgue's theorem

Actually there are some differences, and they relate to whether or not the modifier has the properties of a normal noun in the language. For example, the modifier has no restriction on number in the genitive construction - "the teachers' book" vs. "the teacher's books" or any other combination - either the head or modifier can be pluralized, or both or neither, like in other environments. This is not true for English compounds, suggesting the modifier is not quite a complete noun in itself.

And if we say phrases are like words in that they can be pronominalized (at least for nouns/NPs), then that is another criterion by which compound modifiers are not normal noun modifiers or phrases - they cannot generally be referred back to by a pronoun: "the book club read it" cannot mean that the club read the book after which it was named "the book club" - we have to introduce some other book earlier in the discourse as the antecedent.

sylvainkahane commented 2 weeks ago

Of course, compounds in English have specific properties both distributional and functional. The fact that the dependent noun cannot be inflected and is not referential is very important. The question is not here. The question is: what are the criteria we want to use in the definition of universal syntactic relations? For the particular case of nmod, do we want to add that the dependent must be a true NP and should be referential? I am not against that but it is not a restriction we have in the definition of nmod today. Not also that there is a continuum between wordness and phraseness and compounds are closer to words than other phrases. But it is exactly the kind of contrast UD wants to smooth, putting the functional words aside, using terms such as case for the role of adpositions, etc. These differences between words and phrases are typical distributional properties and not functional.

Let’s give another example. In French we have a similar contrast between two constructions, whether the dependent noun has a determiner or not: le livreur de la pizza ’the pizza’s boy’ vs le livreur de pizza ’the pizza boy’. In the second case, ‘pizza’ is not referential, it can less easily be modified, it is almost impossible to add an adjective before de pizza. But there is a difference between French and English, because both constructions use the same preposition de and they merge when we have a proper noun: le théorème de Lebesgue ’The Lebesgue theorem, Lebesgue’s theorem’. Do we think that we must distinguish the two constructions and use nmod only when the dependent is a referential NP? If I come back to English, I don’t say I want to suppress the distinction between the possessive construction and the compound construction (of course not, I am a distributionalist), but I want to understand how UD wants to define the syntactic relations and whether English compounds are or not a particular case of nmod and whether or not we should replace compound by nmod:compound or nmod:whatever_you_wantin this particular case.

@jnivre What is your opinion? (but maybe you are biased because you are also native of a Germanic language).

jnivre commented 2 weeks ago

I am probably biased but maybe in a different way than English speakers, because the distinction between compounding and modification is more clear-cut in Swedish, not only because of orthography (compounds are written without internal spaces, at least in normative orthography) but also because of prosody. As you may remember, Swedish is a tone language, and compounds have one of our two word tones, which phrases including modifiers never have. However, to complicate things, the first part of a compound can be referential, as in "Palmemordet" (the Palme murder), which is the normal way of referring to the (still unsolved) murder of our prime minister Olof Palme in 1986, and this compound is basically synonymous with the phrase "mordet på Palme" (lit. the-murder on Palme), and even the Saxon genitive "Palmes mord" (Palme's murder) is marginally possible (although unnatural in most contexts).

One way of describing this state of affairs is then that Swedish can use three different morphosyntactic strategies (which Croft would call juxtaposition, flag, and linker, respectively) for one and the same functionally defined construction, nominal modification. And from this point of view, it makes perfect sense to use the nmod relation in all three cases, possibly with different subtypes to distinguish the strategies. However, there are a number of complications that we need to take into account.

First of all, the current UD taxonomy of syntactic relations was not defined from the beginning with the goal of separating functionally defined universal constructions from morphosyntactic strategies, even though parts of the taxonomy are perhaps compatible with such a view. It would therefore be hard to implement this idea for the entire taxonomy, which is why I personally see this discussion as mostly relevant for version 3 of the guidelines, which could involve a revision of this taxonomy.

Secondly, the main motivation for having the compound relation in the first place (in my view) is the adoption of the lexical integrity principle in UD, which implies that word-internal relations should not be modeled with the same concepts as word-external relations. Therefore, if it is really true that compounding creates words, rather than phrases, then the nmod relation is not applicable because it is (by definition) a relation of phrasal modification. In some sense, then, the lexical integrity principle overrules the wish to capture constructional similarity. This is how I interpret the (admittedly vague) formulation "morphosyntactically behave like single words". So, applying a relation like nmod to cases of compounding would mean giving up the lexical integrity principle, at least for Swedish, where it is pretty clear that compounds do behave like single words, and this would mean changing a major design principle of UD, which again seems hard to do under v2 of the guidelines.

Thirdly, even though compound modification can be referential in Swedish, it doesn't have to be, which means that the strategy of juxtaposing two lexical stems to form a single word can be associated with multiple (functionally defined) constructions. Maybe some or most of this are similar enough to be grouped under nmod, and I don't want to claim that referentiality is a necessary condition of the nmod relation (although possibly a sufficient one). And for French, unlike there is evidence that "livreur de pizza" behaves like a single word, it seems completely fine to me to use nmod for both the referential and the non-referential case.

Stormur commented 2 weeks ago

First of all, the current UD taxonomy of syntactic relations was not defined from the beginning with the goal of separating functionally defined universal constructions from morphosyntactic strategies, even though parts of the taxonomy are perhaps compatible with such a view. It would therefore be hard to implement this idea for the entire taxonomy, which is why I personally see this discussion as mostly relevant for version 3 of the guidelines, which could involve a revision of this taxonomy.

This is interesting to read, because from a "synchronic" point of view this kind of separation seems (at least to me) to be one of the main goals. Personally, it is an impression that actually grew stronger and stronger annotating data myself, as a kind of necessity.

One could also argue that it is not possible to do otherwise if the goal is to achieve comparability. If we, for example, say that the vague notion of compound cannot but follow language-specific logics in each language, well, I do not see what it will then be useful for in a universal context (and here I would just more or less repeat all issues laid out by @sylvainkahane ): at this point, annotation collapses into individual, incommensurable formalisms. And by the way, Croft does not say that constructions in different languages cannot be compared (it is actually what he does all the time), so I do not think that Radical Construction Grammar justifies a similar approach.

There are many layers of annotation in UD, and we do have means (linear position; morphological features; presence of functional elements...) to distinguish all the cases discussed here. This makes for interesting annotations in my opinion, not blurring and conflating these layers.

Just my 2 cents on some more specific issues raised by @jnivre : prosody and lexical integrity.

Prosody (so elusive in written data) might be useful to understand some phenomena, but since it acts at a super-word level, it cannot be a sufficient condition for wordhood, not as it is for phrasehood, at least. So many elements which are uncontroversially distinguished as separate syntactic words fuse together prosodically that I do not feel we can use it as a criterion. Let's think of all clitics, which, as discussed with regard to paper on words by Haspelmath, are almost by definition prosodically not independent.
Lexical integrity: I think that the argument used to retain the use of compound can actually be reversed. In each of the compounds that we discussed, it is hard not to recognise two or more referents: they are clearly identified, these processes are all productive, and we are not really interested in their compositionality or not as this is more a social, conventional factor. So applying lexical integrity to me actually entails keeping these different components separate instead of "obliterating" them into a single entity of dubious internal homogeneity (thinking again of productivity). This should be reinforced by the consideration that UD works at the level of syntactic words, and that most annotators already agree that spelling conventions do not coincide with wordhood. We do see cases in which a word is etymologically a compound, but its components have fused beyond lexical integrity: I would put forth the Latin nuncupo 'call by a name', where nomen 'name' and capio 'to seize' are hardly recognisable anymore and their combiantion does not represent a synchronic process: this is clearly just one word now, as opposed to Palmemord etc.

nschneid commented 2 weeks ago

This should be reinforced by the consideration that UD works at the level of syntactic words, and that most annotators already agree that spelling conventions do not coincide with wordhood.

I think it is confusing to use one definition of "syntactic word" for purposes of determining the tokenization/units that get dependency relations, and another, more nebulous definition of "word" for purposes of grouping some of those units together via compound. I agree with others in this thread that we should look for a clearer set of criteria for compound.

Inevitably, orthographic conventions will dictate the tokenization to some extent. For some terms, N+N spelling preferences can differ within a language community ("tabletop" or "table-top" or "table top"?). Semantically, it is tempting to say that these are all similar and captured by some broad notion of wordhood, such that even if tokenization differs the compound relation would signal broad wordhood. But if we consider freeness of modification ("[4-legged table] top" perhaps, but not "4leggedtabletop") or coordination ("[kitchen or [dining room]] table"), it seems to me that N+N combinations can extend beyond the realm of even the broadest notion of wordhood, so calling them all compound under our current definition is misleading.

In addition to the extremely frequent N+N combinations, the term "compound" in English can also apply to complex attributive modifiers written with spaces or hyphens, like "4-legged" or "fire-breathing", as discussed in §4 of our Mischievous Nominals paper. Some of these are productive: considering "fire-breathing" and "church-going" as two examples of one pattern, one could argue there is a morphological process at work rather than a syntactic one, with V+obj or V+obl combinations being repackaged (with the V second) as effectively adjectives. Here, though "fire" and "church" are nouns and dependents of "breathing" and "going" respectively (because the participles better reflect the distribution of the phrase), it is hard to say that "fire" and "church" attach as nmod or obl. Perhaps we could reserve compound for such quasi-morphological cases where no other deprel is plausible, and the more common N+N case could be renamed to nmod:compound.

jnivre commented 2 weeks ago

It seems that I did not quite manage to get my points across so let me try to express myself more clearly.

Two cornerstones of the UD annotation framework are (a) lexicalism and (b) dependency. Lexicalism means drawing a strict boundary between word-internal structure, handled in the morphological annotation layer, and word-external structure, handled in the syntactic layer. Dependency means analysing syntax in terms of functional relations between words, rather than constituent structure. Neither of these assumptions is perfectly upheld in the current version of UD, and there is a lot to say about dependency as well, but I will focus on lexicalism for now.

A consequence of lexicalism is that, if language A uses morphology to encode a phenomenon, while language B uses syntax, then the annotations will look radically different even if the function encoded is (essentially) the same. Thus, if language A uses instrumental case and language B uses a preposition to encode that a nominal is an oblique agent phrase in a passive construction, then this will be captured in the annotation by the presence of a feature Case=Ins on the noun in language A and by the presence of relation labeled case from the noun to the preposition in language B. And the fact that there is some kind of functional equivalence between the two is not captured explicitly anywhere, which some people find problematic. (As an aside, this is the major motivation for the representation used in the upcoming UniDive shared task on morphosyntactic parsing, which tries to abstract over word boundaries by representing both of these encodings by a Case feature.)

Now, in a perfect world, this would be the only case where annotations look radically different even if the function is essentially the same. Unfortunately, we also have cases where the "words" used as annotation units in a treebank are not true morphosyntactic words. Therefore, we have at least three relations that are not true syntactic relations, but rather exist for the purpose of fixing segmentation mismatches, namely fixed, goeswith and compound. Of these, I assume that fixed and goeswith are relatively uncontroversial (even though the criteria for applying fixed are hard to define exactly). In both cases, we are dealing with "words with spaces", and we don't attempt to analyse the internal syntactic structure simply because there is none (synchronically). And the distinction between them is primarily based on whether the spaces are conventional or accidental.

For compound I realise that this may be more debatable, partly because the term "compound" is vague in itself and has been applied to different types of expressions in different languages, but in the context of UD I have always thought of compound as a tool for fixing words with spaces, where the cause of the mismatch is a morphological process rather than grammaticalisation (fixed) or typographical errors (goeswith). And it is only when we have pasted the elements together into a syntactic word that we can combine it with other words using real syntactic relations. Therefore, the fact that current UD doesn't capture the fact that "orange juice" in English and "jus d'orange" in French are in some sense functionally equivalent is in my view analogous to the example of morphological vs. syntactic case markers discussed above.

Now, if people don't think that "orange juice" in English is one syntactic word, then I think we should stop using compound and probably use nmod itself. But to simultaneously maintain that compounding is a (morphological) word formation process and use nmod for the internal relation would violate one of the fundamental assumptions of the UD scheme. That's all I wanted to say.

Finally, to address one of @nschneid's comments, I don't think there are two different definitions of "syntactic word", but I think we have not made explicit enough in the guidelines that, because of the inevitable segmentation mismatches due to standard tokenisers, some of our "syntactic" relations are really tools for stitching together syntactic words. Incidentally, this is also why I think it is wrong -- under the current UD guidelines -- to segment compounds written without spaces in Swedish and German, because they are syntactic words (and their constituent parts are not).

Despite my best intentions, I may have ended up rambling, so do let me know if anything is still unclear. :)

nschneid commented 2 weeks ago

Thanks, that helps! I guess what I am trying to say in regard to English compounds is that

It is hard to argue that "kitchen or dining room + table" is one syntactic word.
It is hard to argue that "weapons + program" is one syntactic word, because "weapons" is morphologically plural and "program" is not.
It is hard to argue that "egg + carton" is a different construction from the above, or it is at least difficult to imagine tests that are straightforward to apply. (Yes, like many N+N combinations the first N resists pluralization, but that is not a hard and fast rule for what we call compounds.)

So, if we were to go with "syntactic word that happens to contain multiple tokens" as the criterion for compound, as opposed to "exhibits what would be traditionally called a compound construction", I think it would be better to regard the construction in the above examples as a kind of nmod.

jnivre commented 2 weeks ago

Thanks, @nschneid. The coordination case occurs in Swedish too, but the standard orthography indicates that it is a case of ellipsis: "köks- eller matsalsbord" = "köksbord eller matsalsbord". In addition to the hyphen, which indicates the missing second part, it is worth noting that "köks" ("kitchen"+s) has the special "s" morph, which only occurs in compound formation.

Is an elliptic analysis conceivable in English too? This would treat "kitchen and dining room table" as elliptic for "kitchen table and dining room table". Through promotion, the element "kitchen" would then take the place of the missing compound head as the first conjunct:

conj(kitchen, table) cc(table, or) compound(hall, dining) compound(table, hall)

nschneid commented 2 weeks ago

I would not analyze the English expression that way. It is much simpler to treat the coordination as regular coordination of a nested dependent.

Here is a treebank example—"a short paper and pencil screening test":

In the ellipsis analysis, I guess this would be interpreted as "a short paper- and pencil-screening test", i.e. 'a short paper screening test and a short pencil screening test'. But that doesn't fit the meaning at all.

(You could argue "paper and pencil" is a multiword expression, but I don't think we make exceptions for MWEs that are built on general syntactic constructions.)

jnivre commented 2 weeks ago

Good point. I agree that an ellipsis analysis would be plain wrong in this case. Similar examples are marginally possible in Swedish too, but I think most people would use hyphenation to indicate the exceptional status of the coordinated element: "papper-och-penna-test". It seems clear that compounding in English is less constrained than in Swedish and more similar to phrasal modification, but it is not clear whether this is grounds for abandoning the compound relation or just to restrict it to the less flexible cases. Regardless, as long as we're in UD v2, it would be useful to clarify the guidelines for compound.

amir-zeldes commented 2 weeks ago

Hi again - this is obviously a very complex discussion, but I'd just like to point out that if we think that "Palmemordet" is a compound, or perhaps even "nuncupo", then the relation compound will be needed. Because as discussed above, UD tokens basically rely on whitespace in languages that separate words, there will inevitably be cases of things spelled apart which will instantiate the same relations we consider to be compounds within words spelled together.

Whether English noun-noun, or other types of compounds as identified in traditional (non-UD) linguistics should use the compound relation or rather nmod is a different question, and probably one that should be discussed within the English guidelines, not as a universal. In general though, I think it's important to keep some level of alignment with traditional notions, because each step we take away from that makes the data more idiosyncratic and less usable for linguists who are not specialists in dependency parsing, which I think is an important target audience for the resources we work hard to build.