UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
270 stars 245 forks source link

Infixes #384

Closed ermanh closed 6 years ago

ermanh commented 7 years ago

The discussion on mesoclitics in Portuguese in #315 raises a similar question we had about how infixes should be treated in UD. From what I can gather from the Portuguese discussion, the solution for now would be simply to separate them like the following?

1-2 xxxINFIXxxx
1   xxxxxx
2   INFIX

It seems the only unavoidable(?) drawback is that we would always have to display the original word/sentence alongside the tree because the tree itself would have the order jumbled.

In Cantonese we have the word 鬼 gwai2 'ghost, devil' which can be infixed into a bisyllabic, non-compositional word with an emphatic function (analogous to English 'fucking' in 'abso-fucking-lutely'). The interrogative 乜嘢 mat1je5 'what' can also be infixed, with a "What do you mean?" meaning:

論盡      論鬼盡         論乜嘢盡
leon6zeon6  leon6-gwai2-zeon6   leon6-mat1gwai2-zeon6
'clumsy'    'downright clumsy'  'What do you mean clumsy?'

To make things even more interesting we can have nested infixes with 鬼 gwai2 'ghost, devil' infixed inside 乜嘢 mat1ye5 'what':

論乜鬼嘢盡
leon6-mat1-gwai2-je5-zeon6
'What the heck do you mean clumsy? / Clumsy how?!'

We haven't encountered this last case in our own data yet but theoretically should it be ideally treated flat as in the following (rather than recursively, which I imagine would be too messy)?

1-3 論乜鬼嘢盡   leon6-mat1-gwai2-je5-zeon6
1   論盡      leon6zeon6
2   乜嘢      mat1je5
3   鬼       gwai2

Thanks!

jnivre commented 7 years ago

This is not different in principle from many other cases where the original form of the token is not obtainable by a simple concatenation of the corresponding words, like French "du" = "de le". However, the crucial question is whether the segments surrounding the infixed segments have to be regarded as one word, or whether they can be analysed with, say, the "compound" relation. In the simple case, we could then get a structure like the following:

xxx1 INFIX xxx2 compound(xxx1, xx2) advmod(xxx1, INFIX)

This is assuming that the infixed element is a dependent of the host. If this is not the case, we will get a non-projective tree but we will still avoid the complex word segmentation. So, is there evidence that the link between the surrounding elements is stronger than for, say, noun-noun compounds in English? If not, then I think a compound analysis would be preferable.

martinpopel commented 7 years ago

Another question is if xxx1 and xxx2 appear without the INFIX, whether it will be still analyzed as two words (compound(xxx1, xx2)) or just as one word. E.g. in English, we probably don't want to analyze abso-lutely as two syntactic words just because you can infix it in some slangs.:-)

jnivre commented 7 years ago

Good point. It seems like a good principle would be to posit multiple words only when required by the (exceptional) circumstances.

ermanh commented 7 years ago

The infix can come between both actual compounds and non-compounds. In the non-compound case for Cantonese, they're adjectives that are considered one word (some may be compounds historically, but some just happen to be bisyllabic and cannot be broken apart at all -- the example of 論盡 leon6zeon6 'clumsy' I gave above is one example).

So what it sounds like is if there is no infix, then just treat it as one word (assuming it is not a compound to begin with), but when this word occurs with the infix, then separate it apart, and connect the two with compound?

Should there perhaps be a standardized sub-label? Just compound by itself (or using this label category by definition as an MWE type dependency) seems misleading if the word is non-compositional to begin with. It seems something like goeswith might have been great if it weren't dedicated to indicating tokenization errors.

Also, should the head of the not-really-compound that's split apart by an infix be the first part by default?

dan-zeman commented 7 years ago

How large or how open is the set of possible infixes? If it is a closed set then we could also say that this is just an operation of morphology. Maybe unusual and Cantonese-specific, but still word-internal. If that is the solution then the lemma will be 論盡 for both 論盡 and 論鬼盡, and a language-specific morphological feature will be defined to annotate the infixed case. The feature could be something simple, e.g. Infix=Gwai2. I think that Finnish can serve as an example language having features of this kind, right, @fginter?

And in the interrogative case, we could combine Infix=Gwai2 with PronType=Int.

fginter commented 7 years ago

Not a native speaker, but I don't think Finnish has infixes of this kind. @jmnybl can correct me.

dan-zeman commented 7 years ago

I should have used more accurate wording. I did not necessarily mean Finnish has infixes. What I meant was that the Finnish UD data contains language-specific features whose values are actual Finnish morphemes, e.g. Clitic=Han (Ka, Kaan...) or Derivation=Lainen (Llinen, Minen...).

fginter commented 7 years ago

Ah, sorry, I misunderstood your question. Yes, we have those features for the various derivations and clitics. The tag is pretty much the derivation suffix / clitic itself. Easy to read and remember.

ermanh commented 7 years ago

Thanks for all the responses! Currently we're not doing features or lemmas but perhaps we might reconsider (@kimgerdes ?). The inventory of infixes is indeed quite small (all with the same emphatic function of 鬼; they're basically all swear words and unlikely to show up in corpora).

Otherwise, the best alternative (?) seems to be to follow the Portuguese mesoclitic example of

1-2 xxxINFIXxxx
1   xxxxxx
2   INFIX

with the caveat that the tree itself would unfortunately not reflect the word form/order correctly.

jnivre commented 7 years ago

A dependency tree is in principle unordered and the numerical order of indices does not encode word order

jnivre commented 7 years ago

Sorry. Hit the wrong button. What I meant to say was that it is not assumed in general that concatenating the words in numerical order gives the original word order. It is precisely for discrepancies like these (among other things) that we have the multiword tokens.