UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 244 forks source link

Proposed correction to UD spec: remove function word constraint for mwe #120

Closed spyysalo closed 9 years ago

spyysalo commented 9 years ago

The current UD specification for mwe reads in part

used for certain fixed grammaticized expressions with function words that behave like a single function word. [emphasis mine]

The function word constraint limits the applicability of mwe considerably, leaving "holes" in the UD treatment of compounding, namely MWEs that don't use/behave as function words. Further, this constraint is not currently followed in language-specific documentation for mwe (see e.g. English, Finnish).

Based on discussion with @jnivre, we believe that this part of the specification as now written does not reflect the original intent on the use of mwe (for example, the initial draft used the phrase multi-word idioms instead of [...] expressions with function words). While it is understood that the specification is generally frozen, we would like to propose a correction, replacing the above with (for example)

used for certain fixed grammaticized expressions that behave like a single word. Examples include compound function words, multiword conjunctions, and adverbs.

All comments are very welcome (including just +1 / -1 on the suggestion).

dan-zeman commented 9 years ago

+1

It seems to me that this is a bug rather than an intentional restriction.

manning commented 9 years ago

I can accept that the current wording is imperfect, and leaves holes, and am happy to see it improved. On the other hand, I think we should also be worried about an overly broad definition which includes too much.

I think I can reconstruct where this wording came from. I could even have been the author of it.

In the large recent computational linguistics literature on multi-word expressions (MWEs), a large portion of what is discussed as MWEs are things like compound noun constructions, particle verb constructions, proper noun phrases, etc. -- things that appear in our "compounding and unanalyzed" section of the dependency table, but are not mwe. The invocation of "function words" was meant to be an attempt to differentiate the things that are mwe from the things that are compound -- which behave like open class lexical items.

So, the question is how to preserve the baby while throwing out the bath water. E.g., would it work to remove the "with function words" but to kepp the "behave like a single function word"? I do see that the reference to "grammaticized" in the proposed correction points in the same direction. A sufficiently clear wording along these lines without explicit invocation of "function words" may be sufficient.

It may be worth looking through some typologies of MWEs. The ACL wiki page http://aclweb.org/aclwiki/index.php?title=Multiword_Expressions presents roughly the same typology as the Sag et al. typology paper that started a lot of the recent MWE craze http://lingo.stanford.edu/pubs/WP-2001-03.pdf . In terms of this typology (AFAICS), our mwe covers only Fixed expressions (section 1.1 of the ACL wiki page, section 2.1 of the Sag et al. paper), and none of the other classes. Actually, maybe it corresponds most closely to the ACL wiki "Fixed expressions" category, since the Sag et al. paper suggests including in this category things like the name "Palo Alto", which we would not include. Maybe it would be useful to reference the ACL wiki page, the term "Fixed Expressions" and some of the criteria mentioned there (e.g., no internal variation). Some may not be universal though -- does the no morphosyntactic variation property actually hold for languages with rich morphology?

At any rate, happy to see improvements to the definition.

jnivre commented 9 years ago

Hi Chris,

My first reaction was indeed to remove the first occurrence of “function word(s)” but keep the second, and I think this may be a good enough approximation. For example, it includes “in spite of”, which behaves like a function word although “spite” is not itself a function word, but it excludes the whole range of things like particle verbs, compounds, light verb constructions, etc. The reason we did not go for this solution is that we also wanted to include adverbial expressions like “by and large” and “all right”, and adverbs are not in general considered to be function words. The important thing for me is that they are completely fixed (do not allow inflection, word order permutations, or intervening modifiers). Perhaps we can write “behave like function words or short adverbials”?

Best, Joakim

On 11 Dec 2014, at 06:05, Christopher Manning notifications@github.com<mailto:notifications@github.com> wrote:

I can accept that the current wording is imperfect, and leaves holes, and am happy to see it improved. On the other hand, I think we should also be worried about an overly broad definition which includes too much.

I think I can reconstruct where this wording came from. I could even have been the author of it.

In the large recent computational linguistics literature on multi-word expressions (MWEs), a large portion of what is discussed as MWEs are things like compound noun constructions, particle verb constructions, proper noun phrases, etc. -- things that appear in our "compounding and unanalyzed" section of the dependency table, but are not mwe. The invocation of "function words" was meant to be an attempt to differentiate the things that are mwe from the things that are compound -- which behave like open class lexical items.

So, the question is how to preserve the baby while throwing out the bath water. E.g., would it work to remove the "with function words" but to kepp the "behave like a single function word"? I do see that the reference to "grammaticized" in the proposed correction points in the same direction. A sufficiently clear wording along these lines without explicit invocation of "function words" may be sufficient.

It may be worth looking through some typologies of MWEs. The ACL wiki page http://aclweb.org/aclwiki/index.php?title=Multiword_Expressions presents roughly the same typology as the Sag et al. typology paper that started a lot of the recent MWE craze http://lingo.stanford.edu/pubs/WP-2001-03.pdf . In terms of this typology (AFAICS), our mwe covers only Fixed expressions (section 1.1 of the ACL wiki page, section 2.1 of the Sag et al. paper), and none of the other classes. Actually, maybe it corresponds most closely to the ACL wiki "Fixed expressions" category, since the Sag et al. paper suggests including in this category things like the name "Palo Alto", which we would not include. Maybe it would be useful to reference the ACL wiki page, the term "Fixed Expressions" and some of the criteria mentioned there (e.g., no internal variation). Some may not be universal though -- does the no morphosyntactic variation property actually hold for languages with rich morphology?

At any rate, happy to see improvements to the definition.

— Reply to this email directly or view it on GitHubhttps://github.com/UniversalDependencies/docs/issues/120#issuecomment-66569673.

spyysalo commented 9 years ago

Thank you for the comment and criticism, I agree that there is a risk of broadening the definition too much. One specific response:

does the no morphosyntactic variation property actually hold for languages with rich morphology?

We can only comment on Finnish, but we'd say that it holds with limited exceptions. (We did a quick survey of the proposed MWEs from TDT (listed under mwe) with @mavela @ammiss @jmnybl)

To elaborate a bit:

  1. Some of the proposed TDT MWEs inflect for mandatory agreement, for example for niin kutsuttu "so-called": niin kutsuttu sääntö "so-called[Sing] rule" but niin kutsutut säännöt "so-called[Plur] rules".
  2. There is a class of suffixes [1] in Finnish that can attach to words of nearly any part of speech (fokuspartikkeli, lit. "focus particle", VISK § 126; in Finnish), and these can also attach to many of the proposed MWEs in TDT (e.g. alun alkaen -> alunkin alkaen or alun alkaenkaan).

However, we feel that if these limited types of morphological variation ruled out mwe, there would be no reasonable UD-conformant way to annotate most of the expressions now proposed for annotation using mwe.

With this minor (and perhaps language-specific) exception, we'd be happy to endorse a definition with reference to the Sag et al. Fixed expressions category, further excluding names as in the ACL Wiki variant. (Could we perhaps say something like "those fixed expressions (Sag et al.) that are not in scope of name or compound"?)

[1] The general term "suffix" used here to avoid the risk of confusion. "Liitepartikkeli" (lit. "attachment particle") is often translated as "clitic", but we don't think these are clitics in the sense that the word is generally used in UD documentation.

dan-zeman commented 9 years ago

The minimal improvement of the definition of mwe would be to explicitly state that there are expressions that some authors would consider MWEs, but they are not annotated using the mwe relation in UD. Linking to the two sources that Chris mentioned would also be highly useful.

spyysalo commented 9 years ago

Here's a conservative revised suggestion:

[...] used for certain fixed grammaticized expressions that behave like function words or short adverbials.

The scope of mwe annotation corresponds roughly to the fixed expressions category of Sag et al., but excludes any relations in scope of name or compound. Additionally, limited morphosyntactic variation may be allowed in exceptional cases.

This only removes the first "[with] function words" (as suggested by @manning), expanding the second to "behave like function words or short adverbials" (@jnivre), and adds reference to the Sag et al. classification (@manning @dan-zeman).

spyysalo commented 9 years ago

If there are no further comments, I'd like to propose to tentatively update the documentation as suggested above and then close this issue. (I'll wait a few more days to confirm.)

spyysalo commented 9 years ago

Updated as suggested and closing now, feel free to reopen if any issue remains.