Closed ermanh closed 6 years ago
This is not different in principle from many other cases where the original form of the token is not obtainable by a simple concatenation of the corresponding words, like French "du" = "de le". However, the crucial question is whether the segments surrounding the infixed segments have to be regarded as one word, or whether they can be analysed with, say, the "compound" relation. In the simple case, we could then get a structure like the following:
xxx1 INFIX xxx2 compound(xxx1, xx2) advmod(xxx1, INFIX)
This is assuming that the infixed element is a dependent of the host. If this is not the case, we will get a non-projective tree but we will still avoid the complex word segmentation. So, is there evidence that the link between the surrounding elements is stronger than for, say, noun-noun compounds in English? If not, then I think a compound analysis would be preferable.
Another question is if xxx1
and xxx2
appear without the INFIX, whether it will be still analyzed as two words (compound(xxx1, xx2)
) or just as one word.
E.g. in English, we probably don't want to analyze abso-lutely as two syntactic words just because you can infix it in some slangs.:-)
Good point. It seems like a good principle would be to posit multiple words only when required by the (exceptional) circumstances.
The infix can come between both actual compounds and non-compounds. In the non-compound case for Cantonese, they're adjectives that are considered one word (some may be compounds historically, but some just happen to be bisyllabic and cannot be broken apart at all -- the example of 論盡 leon6zeon6 'clumsy' I gave above is one example).
So what it sounds like is if there is no infix, then just treat it as one word (assuming it is not a compound to begin with), but when this word occurs with the infix, then separate it apart, and connect the two with compound
?
Should there perhaps be a standardized sub-label? Just compound
by itself (or using this label category by definition as an MWE type dependency) seems misleading if the word is non-compositional to begin with. It seems something like goeswith
might have been great if it weren't dedicated to indicating tokenization errors.
Also, should the head of the not-really-compound that's split apart by an infix be the first part by default?
How large or how open is the set of possible infixes? If it is a closed set then we could also say that this is just an operation of morphology. Maybe unusual and Cantonese-specific, but still word-internal. If that is the solution then the lemma will be 論盡 for both 論盡 and 論鬼盡, and a language-specific morphological feature will be defined to annotate the infixed case. The feature could be something simple, e.g. Infix=Gwai2
. I think that Finnish can serve as an example language having features of this kind, right, @fginter?
And in the interrogative case, we could combine Infix=Gwai2
with PronType=Int
.
Not a native speaker, but I don't think Finnish has infixes of this kind. @jmnybl can correct me.
I should have used more accurate wording. I did not necessarily mean Finnish has infixes. What I meant was that the Finnish UD data contains language-specific features whose values are actual Finnish morphemes, e.g. Clitic=Han (Ka, Kaan...)
or Derivation=Lainen (Llinen, Minen...)
.
Ah, sorry, I misunderstood your question. Yes, we have those features for the various derivations and clitics. The tag is pretty much the derivation suffix / clitic itself. Easy to read and remember.
Thanks for all the responses! Currently we're not doing features or lemmas but perhaps we might reconsider (@kimgerdes ?). The inventory of infixes is indeed quite small (all with the same emphatic function of 鬼; they're basically all swear words and unlikely to show up in corpora).
Otherwise, the best alternative (?) seems to be to follow the Portuguese mesoclitic example of
1-2 xxxINFIXxxx
1 xxxxxx
2 INFIX
with the caveat that the tree itself would unfortunately not reflect the word form/order correctly.
A dependency tree is in principle unordered and the numerical order of indices does not encode word order
Sorry. Hit the wrong button. What I meant to say was that it is not assumed in general that concatenating the words in numerical order gives the original word order. It is precisely for discrepancies like these (among other things) that we have the multiword tokens.
The discussion on mesoclitics in Portuguese in #315 raises a similar question we had about how infixes should be treated in UD. From what I can gather from the Portuguese discussion, the solution for now would be simply to separate them like the following?
It seems the only unavoidable(?) drawback is that we would always have to display the original word/sentence alongside the tree because the tree itself would have the order jumbled.
In Cantonese we have the word 鬼 gwai2 'ghost, devil' which can be infixed into a bisyllabic, non-compositional word with an emphatic function (analogous to English 'fucking' in 'abso-fucking-lutely'). The interrogative 乜嘢 mat1je5 'what' can also be infixed, with a "What do you mean?" meaning:
To make things even more interesting we can have nested infixes with 鬼 gwai2 'ghost, devil' infixed inside 乜嘢 mat1ye5 'what':
We haven't encountered this last case in our own data yet but theoretically should it be ideally treated flat as in the following (rather than recursively, which I imagine would be too messy)?
Thanks!