UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
270 stars 245 forks source link

"must see" #753

Closed ethanachi closed 3 years ago

ethanachi commented 3 years ago

An interesting border case happens here:

http://match.grew.fr/?corpus=UD_English-EWT@2.7&custom=5feef22b3ff08&eud=yes

where "must see" is pretty clearly a compound with an adjectival function, but the validator assumes that AUX (and specifically AUX) can never be in a compound...

My feeling is that the compound analysis is correct and the validator is making false assumptions, but curious as to others' opinions.

nschneid commented 3 years ago

Yeah I have noticed similar issues where a phrase with compositional structure is converted to an attributive modifier, like "make-or-break decision" and "easy-to-use tool"—should we recognize the internal structure reflecting the derivational origin of the expression, or just use compound across the board?

nschneid commented 3 years ago

Related: UniversalDependencies/docs#648, UniversalDependencies/docs#478, UniversalDependencies/docs#525

dan-zeman commented 3 years ago

Hmm, is it still AUX if it enters a compound? Isn't it VERB or something then?

nschneid commented 3 years ago

I think it's a compound in the sense that "church-going" is a compound in "he is a church-going person". But is compound intended for a narrower use of combining two words with like parts of speech?

dan-zeman commented 3 years ago

I don't think that the guidelines explicitly ban auxiliaries from the compound relation (although I did not check carefully now). I would expect auxiliaries to be more likely to participate in fixed than in compound (because they are function words) but must see is obviously not what UD annotates as fixed. The reason why I made the validator report AUX-compound as an error was (if I recall it correctly) that some treebanks annotated light verbs as AUX, which is wrong.

Are there other compounds that contain auxiliaries? Or should we say that must see is an English-specific exception?

dan-zeman commented 3 years ago

I transferred this issue to the docs repository because it has wider impact than just English EWT.

nschneid commented 3 years ago

Are there other compounds that contain auxiliaries? Or should we say that must see is an English-specific exception?

Off the top of my head I can think of:

To make matters worse, some of these MWEs can function as nouns:

Not sure how to analyze those in UD.

amir-zeldes commented 3 years ago

Will post this also to the other issue, but for the record I think it's a normal compound with a phrasal modifier, in which all but the local head should keep their pos and deprel, so:

must/AUX <-aux- see/VERB <-compound- attractions

amir-zeldes commented 3 years ago
  • a must-see television show
  • Those celebrities are has-beens

I think it depends on tokenization. If they are hyphenated and left as single tokens, my preference would be to treat them as ADJ/amod and NOUN/root:

must-see/ADJ <-amod- shows
those <-det- has-beens/NOUN/root

Basically I'd try to punt this to morphology and say "this has some morphologically interesting internal structure, but UD syntax stops at the tokens" (after all, we don't analyze derivation deprels either).

nschneid commented 3 years ago

@amir-zeldes But what if they're not hyphenated?

dan-zeman commented 3 years ago

Is it a rule that they should be hyphenated in English? If it is, then goeswith might be an option.

nschneid commented 3 years ago

Some might prefer the hyphen, but it's not as obligatorily one word as "oversleep" or "anti-war"/"antiwar", where the first part is clearly a prefix so separating with a space is an error. From a COCA search, there are plenty of instances of "must haves", even in edited sources like magazines.

ethanachi commented 3 years ago

for what it's worth, this is an open class ("must-see", "must-have", "must-visit", "can-do") and is often found without spaces, so goeswith doesn't seem appropriate here.

I agree with @amir-zeldes' proposal, although "must-see" as a modifier seems looser than a normal compound, so it still seems best to analyze it as ADJ...

amir-zeldes commented 3 years ago

@amir-zeldes But what if they're not hyphenated?

If they're not hyphenated, I think there is a clear case where we retain the internal analysis, as in "must see movie", and a less clear case where I am more ambivalent, in which a categorical conversion results in the original governing POS being altered, as in the "has beens" case.

For "must see movie", I think the argument should hold that a. it is a nominal compound where the modifier happens to be a phrase, and b. it's a verbal structure "wrapped" in the modifier NP (the conversion). Because of this, I think the relationship between "see" and "must" is maintained, and I would do:

must/AUX <-aux- see/VERB <-compound- movie

For "has beens" it's a little more complicated, because a single word is doing "double duty" (a little like wh pronouns in free relatives). On the one hand, the morphological stem "been" preserves its verbal nature and takes a normal auxiliary "has". On the other hand, it has been converted into a noun and can now take an s-plural and fulfill an argument role. Our two options are then:

  1. has <-aux- beens/NOUN/root
  2. has <-compound- beens/NOUN/root

You can tag 'has' however you like, but I think morphosyntactically it's still AUX inside the compound, so I'd tag it as such (this is consistent with considering 'see' above to be a VERB). Between these two options, losing the internal function or the external function, I would prefer option 2., for two reasons:

  1. We have good precedents for preferring external or "higher level" functions. Choosing compound is analogous to free relatives prioritizing the matrix clause function.
  2. It's odd to have a NOUN with an aux dependent, and at least for me, less odd to have a morphological auxiliary serve as a compound modifier. In a meta-linguistic context, the latter should be possible anyway (we can say 'the "be conjugation" in Latvian')
nschneid commented 3 years ago

Option 2 seems a reasonable compromise (where the head needs to be a noun for external reasons), because the AUX tag preserves something about the internal derivation even though compound reflects the derived status. Based on the POS tags we can see it is a "special" kind of compound, and further analysis of the internal/external syntax such as with SUD could be added later. As long as the validator is willing to be flexible about POS tags of compound dependents.

(A small point: regarding

less odd to have a morphological auxiliary serve as a compound modifier. In a meta-linguistic context, the latter should be possible anyway (we can say 'the "be conjugation" in Latvian')

I'm not sure a metalinguistic use of "be" should remain AUX—if we're talking about the linguistic entity shouldn't it be NOUN?)

dan-zeman commented 3 years ago

I'm not sure a metalinguistic use of "be" should remain AUX—if we're talking about the linguistic entity shouldn't it be NOUN?

No. If a word is mentioned rather than used, it retains its dictionary POS category (see here). We could debate in this case whether it is AUX or VERB, but it is not NOUN.

In general, it looks like the validator cannot exclude AUX from the set of possible compound dependents. I will look into removing the restriction.

Stormur commented 3 years ago

Just chiming in: can't a productive construction like must-VERB and all others of the same kind be analyzed just as an acl and its internal structure be left transparent as it is, i.e. with the VERB as its head? And this would be the same for make or break, while easy to use would just be an ADJ plus its argument? The same for church going, where the word order is probably given by using a participial form instead of the "bare infinitive". In general, English appears to have this way of using verb phrases in an attributive function; no reasons for compounds here. In my opinion, orthographic conventions like hyphens do not seem to come in the way of such an analysis; they rather seem to be an aid for a correct reading. (Sorry if I was repeating some of the previous points.)

nschneid commented 3 years ago

There are a few matches to that analysis currently, e.g. "fly-by-night program": http://match.grew.fr/?corpus=UD_English-EWT@2.7&custom=5ff47e4953be0

Is there a good test for whether attributive modifiers behave like participial clauses or like compound modifiers? Maybe idiomatic ones like "fly-by-night" resist being moved after the noun or made predicative. But maybe we're just tempted to use the term "compound" because of the lexicalization of the expression and we should treat it instead like a more compositional phrase.

dan-zeman commented 3 years ago

As @amir-zeldes showed above, most of the examples in this thread do not need the compound relation with the auxiliary as the dependent. However, if the head acquires nominal morphology (the celebrities are have beens), it cannot be tagged VERB any more. And as a NOUN, it should not have auxiliaries. (The validator would not flag them because they could legitimately occur in copular sentences such as he has been a celebrity; but this is a different case.)

nschneid commented 3 years ago

Right, I was wondering about the attachment of the idiomatic expression to its head noun outside the case where the verb is coerced to a noun and pluralized.

must/AUX <-aux- see/VERB <-compound- movie

Why compound and not acl?

dan-zeman commented 3 years ago

must/AUX <-aux- see/VERB <-compound- movie Why compound and not acl?

My first pick would be acl here. But as a non-native speaker, I don't feel confident about the tests for compoundness in English.

amir-zeldes commented 3 years ago

I think my assumption that it's compound was mainly a knee-jerk reaction to it being a modifier on the left side of a noun in English, but I don't feel terrible about acl here, either. But then ideally all phrasal modifiers of nouns should be like that, no? Are there some cases where that doesn't work? For example, what if it's a PP like:

A by the book attitude

I think in this case acl would be wrong, no?

nschneid commented 3 years ago

Not sure why all phrasal modifiers of nouns should work like that. If in UD terms a PP is a nominal, then it's like a noun modifying a noun, the prototypical compound.

amir-zeldes commented 3 years ago

I think on some level both of these cases are "the same" - you're taking a complex phrase, which normally wouldn't pre-modify a noun, and wrapping it in something (which I initially interpreted to be an NP conversion) so it can serve as a modifier. So in phrasal terms, I thought we have:

(NP (NP (VP (must) (see))) movie)
(NP (NP (PP (by) (NP (the) (book)))) attitude)

I think coming from a Germanic linguistics perspective this is somewhat natural, since in languages like German, the commonality of these cases with regular compounds is a little more transparent. But in English you could indeed argue that these two are not the same construction, and the first one has a clausal modifier. Still, my gut feeling is that they are the same (maybe this is just a German-bias), and giving them different analyses obscures that. The basic structure of both is:

ANY-SYNTACTIC-THING-YOU-LIKE + NOUN

:)

nschneid commented 3 years ago

ANY-SYNTACTIC-THING-YOU-LIKE + NOUN

I can see an argument for an "attributive phrase" construction generalizing over amod, compound, nmod:poss, and maybe some other things. But I'm not sure UD's lexicocentric approach is suited to such generalizations.

Another complication that I realized is that prenominal participial forms tagged as VERB (PTB VBG, VBN) are analyzed as amod. If we were to attach must-see as the acl of movie, why should the relation in raiding forces or foiled plot be amod rather than acl? Would winning strategy and award-winning strategy be treated differently?

nschneid commented 3 years ago

@lorislevin thinks these are all synthetic compounds, and per Bill Croft's interpretation of UD as a representation of information packaging, "must" in "must-see" is not packaged as an auxiliary nor is "award" in "award-winning" an object. She would just use compound.

amir-zeldes commented 3 years ago

Another complication that I realized is that prenominal participial forms tagged as VERB (PTB VBG, VBN) are analyzed as amod. If we were to attach must-see as the acl of movie, why should the relation in raiding forces or foiled plot be amod rather than acl? Would winning strategy and award-winning strategy be treated differently?

Ooh, yes, that would be bad. I much prefer for these -ing cases to be amod, since many are lexicalized into adjectives, and this way even if we quibble about POS, at least the syntax always has them as amod. I'm fine with compound for the phrasal modifier cases and amod for gerund modifiers, it will also keep things more in line with how it's done for German.

Stormur commented 3 years ago

I think my assumption that it's compound was mainly a knee-jerk reaction to it being a modifier on the left side of a noun in English, but I don't feel terrible about acl here, either. But then ideally all phrasal modifiers of nouns should be like that, no? Are there some cases where that doesn't work? For example, what if it's a PP like:

A by the book attitude

I think in this case acl would be wrong, no?

It looks like a simple nmod. There will be some reasons for which its (preferred?) position is before the noun, instead of something like an attitude by the book, but the relation looks the same: it's a mod, it's a noun phrase, hence nmod.

From what I understand, a compound is a kind of flatter relation (and it is indeed listed under MWE relations alongside flat)... naively and crudely expressed, where both element participate in defining an entity, but none is truly subordinated in a "classical" sense. It is true that the line might be thin. But else, everything in an attributive position would count as compound! Hmmm, probably compound is a relation in search for a better definition...

Stormur commented 3 years ago

ANY-SYNTACTIC-THING-YOU-LIKE + NOUN

I can see an argument for an "attributive phrase" construction generalizing over amod, compound, nmod:poss, and maybe some other things. But I'm not sure UD's lexicocentric approach is suited to such generalizations.

If I'm not mistaken, the PDT annotation style uses indeed a more generic ATR relation. But UD makes a difference between verb and noun phrases in the same function... maybe there's an asymmetry between the labels acl and amod and nmod which could be addressed?

Another complication that I realized is that prenominal participial forms tagged as VERB (PTB VBG, VBN) are analyzed as amod. If we were to attach must-see as the acl of movie, why should the relation in raiding forces or foiled plot be amod rather than acl? Would winning strategy and award-winning strategy be treated differently?

We are facing a similar "problem" in Latin too, and our strategy is that, as long as there is no other good reason not to do so, forms of verbal origin stay verbal (and you seem to imply that too, noticing that they are tagged VERB). So we would have winning strategy and award-winning strategy treated identically as acls. These actually takes into account exactly this kind of "latent verbality": it is still possible for most of such participial forms to retain arguments. Then, there are indeed some forms of verbal origin which have completely lost their "latent verbality", e.g. altus 'tall, high', originally perf. part. of alo 'to nourish', but showing no perfect aspect nor passive voice anymore.

I mean, if winning has the amod relation, it should be tagged ADJ. Having VERB but amod looks like a mismatch to me...

Stormur commented 3 years ago

@lorislevin thinks these are all synthetic compounds, and per Bill Croft's interpretation of UD as a representation of information packaging, "must" in "must-see" is not packaged as an auxiliary nor is "award" in "award-winning" an object. She would just use compound.

Question: isn't an award-winning book just another formal way to express a book which won/has won/... an award? A kind of implicit vs. explicit construction?

nschneid commented 3 years ago

Question: isn't an award-winning book just another formal way to express a book which won/has won/... an award? A kind of implicit vs. explicit construction?

At the level of semantic predicate-argument structure, yes. This is the sort of thing AMR (for example) aims to represent. But is the syntactic encoding the same just because it can be paraphrased? In UD we certainly wouldn't annotate "the winning of awards" (obl) the same as "win awards" (obj). "Award-winning" on the surface looks pretty different from both of these due to the word order difference.

amir-zeldes commented 3 years ago

it's a mod, it's a noun phrase, hence nmod

I think by that criterion, all nominal compounds would be nmod, but what we have here is different IMO, both because of argument structure and because tests like pronominalization and interrogation can reveal differences. In nmod, we have a case marked modifier, whose argument structure relates an internal object (the location in a locative preposition) to an external argument (the thing being described as located in a locative prepositions). For example:

In this case 'by' specifies a relational structure. If you interrogate this, you can do:

But this is strange:

For compounds, things are different:

The "attitude" example follows the same pattern:

But not:

And as @nschneid points out, movement behavior is also different - normal English nmods cannot be placed between the article and noun they modify, but compound modifiers can. The analysis of phrasal modifiers as underlyingly converted NPs also conveniently accounts for pure nominalized cases like "they did a by-the-book" or "it's a must-see", where the article indicates NP status more clearly.

For a more detailed discussion and arguments supporting an NP analysis I also recommend this cross-linguistic theoretical paper:

nschneid commented 3 years ago

Indeed, compounds tend to prefer "kind" semantics, but I don't think that's a conclusive test for the construction. You can refer to a "morning meeting" which paraphrases as "a meeting in the morning", or even "our Tuesday 3pm meeting".

nschneid commented 3 years ago

This is interesting; I will have to look at it more closely. I think we should consider whether well-formedness of the modifier, or whether it forms a complete phrase (that could also be used in other environments), should be UD's criterion for analyzing its internal structure compositionally. For example:

Stormur commented 3 years ago

Question: isn't an award-winning book just another formal way to express a book which won/has won/... an award? A kind of implicit vs. explicit construction?

At the level of semantic predicate-argument structure, yes. This is the sort of thing AMR (for example) aims to represent. But is the syntactic encoding the same just because it can be paraphrased? In UD we certainly wouldn't annotate "the winning of awards" (obl) the same as "win awards" (obj). "Award-winning" on the surface looks pretty different from both of these due to the word order difference.

Indeed, compounds tend to prefer "kind" semantics, but I don't think that's a conclusive test for the construction. You can refer to a "morning meeting" which paraphrases as "a meeting in the morning", or even "our Tuesday 3pm meeting".

The basic syntactic encoding would be (correctly, in my opinion) be the same, but only on a higher level, i.e. in absolute terms of relations and dependencies! As you also notice, the sequence of the involved elements, and the presence or absence of some, is on the contrary completely different, and we have to regard this as a determinant syntactic factor, too. So the two constructions would still be clearly differentiable, in fact:

These are all clausal encodings of, yes, probably the same thing, but they are discriminable: acl > or < root? ; presence or absence of relative elements and/or determiners; obj > or < acl? The syntactic tree reflects this. So, they still do not appear the same.
Indeed the winning of awards needs to be differently annotated, and it is, because it is a nominal encoding (so we would see nmod instead of acl).

it's a mod, it's a noun phrase, hence nmod

I think by that criterion, all nominal compounds would be nmod

Indeed, I am quite skeptical about the general definition of compound (I also could not find references for the cited "X0 compounding"). Moreover, the specific guidelines for English somehow contradictorily state that it is the case for constructions "that use regular syntactic relations"... why then not amod and nmod?

Let's take the example of phone book: there's clearly a head which is book, since the intended object is not a phone. So this should be represented by a subordination, in this case nmod. Further, this is the most consistent treatment from a multilingual perspective. If I take the exact same expression in Italian, I have elenco del telefono, lit. 'list of the phone', with an undisputed nmod for del (= di + il) telefono. Or in Greek: τηλεφωνικός κατάλογος tilefonikòs katàlogos, lit. 'telephonic list': again, a modifier, this time an amod. So we observe a prference either for noun or for adjectival modifiers, but the structure stays *mod(X,Y). In English, it just happens that morphology is minimal, so such modifiers often appear to be merely juxtaposed. There is surely some difference in sense if I say e.g. phone's book, but again, all similar Italian expressions like elenco per/con/su/a riguardo di il telefono, varying the connector, are always seen by UD as nmods.

It would be strange to treat cases with "absent morphology" as compounds, and with "present morphology" (always the case e.g. in Latin) instead as *mods. Probably the traditional terminology here interferes with our understanding of the annotation of such constructions.

[...] but what we have here is different IMO, both because of argument structure and because tests like pronominalization and interrogation can reveal differences. In nmod, we have a case marked modifier, whose argument structure relates an internal object (the location in a locative preposition) to an external argument (the thing being described as located in a locative prepositions). For example:

* The book by the table

In this case 'by' specifies a relational structure. If you interrogate this, you can do:

* The book where? -> by the table

But this is strange:

* What kind of book? -> ??by the table; ?? a by the table book

For compounds, things are different:

* Coffee table book

* What kind of book? A coffee table book.

The "attitude" example follows the same pattern:

* What kind of attitude? A by the book attitude.

But not:

* An attitude where? ??by the book

I am not sure I am following you completely on these points. I think this all boils down to the semantics of the relations between the head noun and its modifiers and to the dgree of lexicalization of some co-occurrences, which then makes some questions possible, others weird, and others unacceptable. But in the end, the basic syntax is the same.There surely are different kinds of nmods, and they are articulated in different manners, as said above: choice of a preposition; possession marker; juxtaposition; ...

And as @nschneid points out, movement behavior is also different - normal English nmods cannot be placed between the article and noun they modify, but compound modifiers can. The analysis of phrasal modifiers as underlyingly converted NPs also conveniently accounts for pure nominalized cases like "they did a by-the-book" or "it's a must-see", where the article indicates NP status more clearly.

I am quite confused here. What is defined as a "normal" (English) nmod? As an "external" observer I see that complex modifiers, be they clauses or noun phrases, just need to be rearranged in different ways according to the position they have with respect to their head. Such nominalised cases look like kinds of nominal ellipses to me.

Also, I see some parallelism here between the differences in meaning that an adjective assumes in Italian depending on its position, either before or after the noun. In general, this choice is free, but it may convey different meanings:

But in the end, they are both amods. I am sorry not to have a decent reference at hand for this phenomenon...

For a more detailed discussion and arguments supporting an NP analysis I also recommend this cross-linguistic theoretical paper:

* [Pafel, Jürgen (2017) Phrasal compounds and the morphology-syntax relation](https://langsci-press.org/catalog/view/156/901/799-2)

I will check it for sure! Thanks for the reference! :slightly_smiling_face:

nschneid commented 3 years ago

I am quite confused here. What is defined as a "normal" (English) nmod?

As discussed under Two Nominals, most current uses of nmod in English bear prepositional or possessive marking, which contrasts with compound.

The compound strategy of modifying a nominal with an unmarked nominal (in the same NP, no separate determiners) is probably more natural for Germanic languages than Romance languages/Greek.

amir-zeldes commented 3 years ago

Nominal compounds are certainly a type of nominal modification, so one could have named them nmod:xyz, but please keep in mind that there are also non-nominal compounds (complex verbs for example, and in English this is used for phrasal verb particles like "pick up").

While compounds may not be a frequent or very useful category for describing romance languages or Greek (which univerbizes compounds in X-o-Y into a single token, much like German with linking elements), many languages use the compound relation for a variety of purposes (construct state in Semitic, incorporation), so I would be very reluctant to remove it. I'll write more on this in the issue on compounds (#757) to avoid repetition, but many syntactic criteria distinguish "phone book" from "book of phones" - you can add determiners to modifiers only in nmod (a book of those phones/ * a those phone book), you can easily pluralize the modifer (a book of your phone/phones, but ??phones book), you can pronominalize nmods but not compound modifiers... The list goes on.

In any case I think the core cases of compound in English are quite clear and in my opinion both distinct from case marked nmods on syntactic grounds and useful for practical reasons.