Open spyysalo opened 7 years ago
Currently, http://universaldependencies.org/u/overview/tokenization.html reads in part
The UD annotation is based on a lexicalist view of syntax, which means that dependency relations hold between words [...] there is no attempt at segmenting words into morphemes. [...] the basic units of annotation are syntactic words (not phonological or orthographic words)
Joakim's Coling keynote included the following related heuristic:
What is a word?
- Single part-of-speech tag
- Real syntactic relation
Greg Pringle has written at length on the topic in the context of UD Japanese at http://www.cjvlang.com/Spicks/udjapanese.html . One specifically relevant part:
Linguists later fine-tuned the definition of words with further distributional criteria:
'Positional mobility (syntagmatic mobility)': the word is free to be used at different places in the sentence. For example, John will go can be transformed into Will John go?, indicating that will, John, and go are three separate words.
'Internal stability (internal immutability)': In contrast with the positional mobility that the word enjoys, morphemes within a word are fixed in order. For example, played is a stable unit that does not permit rearrangement as ed-play.
'Uninterruptability': it is not possible to insert anything between the morphemes of a word. For instance, it is not possible to insert anything between play and -ed (e.g., play-be-ed).
This is a great initiative. However, Pringle's characterisation does not quite work for syntactic words, because it seems to exclude clitics. This paper is very relevant as well: http://coltekin.net/cagri/papers/coltekin2016turcling.pdf
Thanks! @coltekin proposes the following criteria
The term inflectional group (IG) in Turkish natural language processing literature refers to a sub-word unit. [...] the unit has been a de facto standard for representing words in Turkish NLP. [...] The current paper proposes tokenizing a surface word into multiple IGs only in case one of the following is true. a. Parts of the word may have potentially conflicting inflectional features. b. Parts of the word may participate in different syntactic relations.
Clause a. only works for languages and words that inflect, but it can be generalized to postags, I think. Fusions are a case in point. Something like French "au" does not have conflicting inflectional features, because the preposition "à" does not inflect, but it has conflicting postags, ADP vs. DET.
Wrt clitics, I was wondering how we would explain the difference between clitics attached to a verb (Spanish dámelo, vámonos) and morphemes that encode agreement of the verb with one or more arguments (and which could be interpreted as hidden pronouns when the respective argument does not appear as an overt nominal). Spanish vamos itself could be said to encode nosotros vamos, and head-marking languages like Basque may even encode agreement with a second and third argument, but we are quite clear about UD not analyzing these as separate words ("do not annotate things that are not there"). While I agree that we should not annotate dropped subjects (or arguments in general), I would love to hear a sound rule that says why vámonos is different.
One clear difference I can think of right now is exactly the possibility (in the agreement version) that the argument also appears as a separate word. However, I am not sure that a guideline based on this would work in languages that have clitic doubling.
I agree that the distinction between pronominal clitics and inflectional endings that mark agreement is a bit fuzzy and that the former could develop into the latter historically. For the Spanish case, I would appeal to language-internal arguments about systematicity. Subject agreement is present in all verbs in Spanish, even with a separate overt subject, but "object agreement" is not. Therefore, when a morpheme referring to the object is glued to the verb, we avoid exceptions if we separate it from its host. How does that sound?
On a more general note: I think that the default practice in UD is that we take orthographic words (in languages where they exist) as units. Only when we see good reasons to treat certain cases as contractions of multiple syntactic words, we split them; but if they work reasonably well as one word, we keep them together even though a parallel expression in a related language is rendered as multiple words. And there is some flexibility to these decisions, as always.
Conversely, when we have a good reason to say that one syntactic word spans several orthographic words, we connect them using the fixed
relation, except for the very limited cases where we are allowed to include a space in a word form.
@jnivre : sounds good to me, thanks. If this discussion results in a set of guidelines for UD-wordness, we should include the Spanish example with this explanation.
Stupid question: what's the rule against analyzing e.g. joined as join/VERB ed/AUX
? Would it work the same if English didn't use space?
@spyysalo : irregular verbs? If the English writing system did not use space, it would still not be clear where the auxiliary begins in threw.
Nice :-) How might that look as a general rule? Are there other arguments that could apply?
There is a lot of literature in linguistics on distinguishing clitics from morphology. Among others, Arnold Zwicky has written a lot on this issue. Of course, as @jnivre notes, the results are not always crystal clear due to grammaticalization. There is an argument that some have pursued that romance "clitics" should really be treated as morphology. So I suspect that @dan-zeman's "On a more general note" is really part of the answer for us. But if you'd like to read more about how words, clitics, and morphology have been argued to be distinguishable, here's a little reading list:
Awesome, thanks!
@manning: thanks, that great! I'll dig in after recovering from Coling :-)
The reason I'm asking this particular question now is the proposed UD Japanese analysis of e.g. 食べた tabeta “ate” as 食べ/VERB
た/AUX
(with aux
dependency: http://universaldependencies.org/ja/overview/syntax.html#special-clausal-dependents). 食べた strikes me as equally obviously a single-word past form verb as the English joined, and I'd like to know what arguments in the UD framework would support keeping this as one word.
@spyysalo I think what you're looking for is what this paper calls "MUW" (Middle Unit Word):
http://www.lrec-conf.org/proceedings/lrec2016/pdf/122_Paper.pdf
@amir-zeldes: this work is certainly highly relevant (and discussed at length in http://www.cjvlang.com/Spicks/udjapanese.html, referenced above), but unfortunately even the longest "LUW" (Long Unit Word) in the paper splits 食べた into 食べ/VERB
andた/AUX
(see Figure 1).
You're right, sorry for the confusion! I agree, it's odd for this to be a token while other agglutinative morphemes (e.g. potential) are not.
@spyysalo, one strong reason to split 食べた (ate) into 食べ/VERB
and た/AUX
is other elements such as a polite marker can be inserted between these two words (食べました - 食べ/VERB
まし/AUX
た/AUX
). Considering English perfective and French passé composé, I think it is quite natural to represent a past form with two words.
@kanayamah, considering simple past in English as well as in good many other languages, it is also quite natural to represent a past form with one word :-) Is the range of "words" that you can insert before た wider than just the politeness morpheme? Like in English "I have just eaten", "I have not eaten", "I have already eaten" (and the inserted thing is always indisputably an independent word).
Multiple affixes can attach to a stem in morphologically rich languages, but it does not necessarily warrant wordness of these affixes. For instance, the Czech adjective zvláštn-í "strange" ends with the morpheme í, which contributes the Case
and Number
features, and you can insert the comparative morpheme ejš before that (zvláštn-ějš-í "stranger") but it does not imply that either ejš or í are independent words.
@kanayamah I think one of the reasons some people are uncomfortable with this but have less problems with French is that the French auxiliary can stand by itself: it is white-space separated, it looks identical to either the verb 'be' or 'have' as used outside of the past tense construction, and it can be separated from the lexical verb by adverbs: 'il est déjà arrivé'. The morpheme た behaves more like an inflectional marker: its position is completely predictable and it cannot be used independently.
I agree with the points raised by @dan-zeman and @amir-zeldes. Specifically regarding the argument given by @kanayamah, I don't think the occurrence ofます
between 食べ
and た
in 食べました
is a reason to analyze た
as an independent (syntactic) word rather than as a morpheme, as we can either consider 食べました
as a word or た
as an inflectional morpheme on ます
.
@kanayamah : to clarify my concern a bit: the currently proposed UD Japanese representation implies the radical claim that Japanese has (effectively) no inflectional morphology. To the best of my knowledge, this is is a departure from both traditional and previous theoretical linguistic analyses of Japanese verb behavior.
Some of the published UD Japanese material suggests that you agree that the proposed (SUW) analysis segments into morphemes rather than syntactic words, e.g. http://www.lrec-conf.org/proceedings/lrec2016/pdf/122_Paper.pdf:
Short Unit Word (SUW): SUW is a minimal language unit that has a morphological function.
This kind of construction [...] is considered to be morphological (a word formation), rather than a syntactic relation. [regarding the derivation of e.g. かわいさ and 子どもっぽい]
However, the UD representation does not segment words into morphemes (see http://universaldependencies.org/u/overview/tokenization.html).
Are you using UD to represent the morphological structure of Japanese verbs (etc.), or does UD Japanese claim that Japanese verbs have no morphological structure?
@dan-zeman @spyysalo Many other expressions (aspect, modality, passive, causative, benefactive, etc.) can appear between 食べ and た. Their composition is very systematic (no exceptions like "threw"), and some expressions can appear multiple times (e.g. double negation) and sometimes can be coordinated. I don't find any reason to prevent from segmenting them.
@spyysalo Modern syntactic theories consider these expressions as independent tokens. For example, the following book presents a comprehensive theory of Japanese syntax based on CCG, in which verb-following expressions are considered as independent tokens and given category S\S.
Bekki, Daisuke, 2010. Nihongo Bunpoo no Keesiki Riron: Katsuyoo taikee, Toogo koozoo, Imi goosee [Formal theory of Japanese grammar: The system of conjugation, syntactic structure, and semantic composition]. Japanese Frontier Series 24. Kurosio Publishers.
Anyway, I think we need a clear definition of "word" in the UD sense, which does not depend on the existence of spaces.
@miyao-yusuke : thanks for the quick response, your input is much appreciated! Some comments:
I think we need a clear definition of "word" in the UD sense, which does not depend on the existence of spaces.
Very much agreed! In the absence of more qualified volunteers, I've been trying to draft a rough proposal on this, but the more I read on attempts to define "word" cross-linguistically (e.g. Haspelmath 2011), the less confident I am...
no exceptions like "threw"
Wouldn't する
→ した
, くる
→ きた
, and いく
→ いった
qualify as irregular past forms?
Modern syntactic theories consider these expressions as independent tokens. For example, [Bekki, Daisuke, 2010. Nihongo Bunpoo no Keesiki Riron: Katsuyoo taikee, Toogo koozoo, Imi goosee] presents a comprehensive theory of Japanese syntax based on CCG, in which verb-following expressions are considered as independent tokens
Thank you for the pointer! I could unfortunately not find a copy of 日本語文法の形式理論 (I take it) online, but will try to follow up on this. A couple of clarifying questions:
Is it OK to equate "token" with (syntactic) "word" here? That is, would it be correct to say Modern syntactic theories consider expressions such as た
in 食べた
as independent syntactic words?
Would it be correct to say that UD Japanese follows the approach of Bekki (2010) and rejects the view that Japanese verbs have past forms, analyzing past expression formation as a syntactic (rather than morphological) process?
Also, regarding the above statement from Tanaka et al. 2016, do you agree with
For example, the suffix さ sa changes an adjective into a noun [...] and っぽい ppoi changes a noun into an adjective [...] This kind of construction is very generative, and it is considered to be morphological (a word formation), rather than a syntactic relation.
i.e. to analyze the formation of e.g. かわいさ
and 子どもっぽい
as morphological rather than syntactic processes?
Wouldn't する → した, くる → きた, and いく → いった qualify as irregular past forms?
Popular analysis is する is inflected to さ, し, せ, する, すれ, etc., and function words/morphemes attach to one of them; e.g. し + ない, ている, た, よう, etc. くる and いく are similar.
Is it OK to equate "token" with (syntactic) "word" here? That is, would it be correct to say Modern syntactic theories consider expressions such as た in 食べた as independent syntactic words?
They avoid defining "word". The important message here is that the distinction between word and suffix/morpheme is not very clear nor meaningful in Japanese, and considering everything as "syntactic unit" can explain various complications and interactions with syntax.
Would it be correct to say that UD Japanese follows the approach of Bekki (2010) and rejects the view that Japanese verbs have past forms, analyzing past expression formation as a syntactic (rather than morphological) process?
We are not following this specific theory, but follow the definition of SUW. However, I think we share the fundamental idea. Function words/morphemes of Japanese cannot be clearly classified into word or suffix/morpheme. Some tend to behave more like word, while others are closer to suffix, but most are somewhere in between. As far as I know, linguistics does not have a clear conclusion until now.
I'm not objecting to the inflection-based analysis. I mean both (and other possibilities in between) are acceptable and I cannot find any clear reason to reject one. We simply selected one of them (mainly due to practical reasons).
For example, the suffix さ sa changes an adjective into a noun [...] and っぽい ppoi changes a noun into an adjective [...] This kind of construction is very generative, and it is considered to be morphological (a word formation), rather than a syntactic relation. i.e. to analyze the formation of e.g. かわいさ and 子どもっぽい as morphological rather than syntactic processes?
These specific examples might look like derivational morphemes (actually we had some discussions in the Japanese UD team about these constructions). However, they also have many characteristics of "wordness" (actually, many of the "wordness" criteria of Zwicky, Haspelmath, etc. can apply to them).
I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups. I still think we can use the paper by Cagri on principles for Turkish as a starting point, but it obviously needs to be refined and extended to cover typologically different languages. My own superficial impression of the Japanese situation is that we need a compromise solution, where some of the contested items are regarded as words and others as bound morphemes, but this does obviously not carry a lot of weight.
@miyao-yusuke :
The important message here is that the distinction between word and suffix/morpheme is not very clear nor meaningful in Japanese, and considering everything as "syntactic unit" can explain various complications and interactions with syntax.
I appreciate that, and I understand the merits of segmenting down to morphemes and representing morphological and syntactic structures in a unified fashion.
However, UD does not (currently :-)) address word formation, and I'm concerned that not defining a word unit above the morpheme level will have a negative impact on the value of UD Japanese annotations for cross-linguistic use cases.
I'm not objecting to the inflection-based analysis. I mean both (and other possibilities in between) are acceptable and I cannot find any clear reason to reject one.
I'm very happy to hear that, and I hope we can establish a good argument in support of a choice soon!
@jnivre :
I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups. [...] My own superficial impression of the Japanese situation is that we need a compromise solution, where some of the contested items are regarded as words and others as bound morphemes
+1 on both!
@jnivre :
I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups.
Great idea! happy to join the working group and represent the Semitic angle.
I would also be happy to join the working group and represent the Indo-European languages with Perso-Arabic script. Word segmentation is a major challenge in the languages like Persian with regard to its special characteristics of different writing styles of "words".
I'd be happy to represent the Finnic language family :-)
Following up on the discussion about UD Japanese, it seems clear that the segmentation of derivational morphology is not consistent with general UD principles, nor with practice in other UD treebanks, so I definitely think that should be changed. Similarly, when it comes to verbal morphology, all the descriptions so far point to this being a regular and systematic case of agglutination, similar to the situation in Turkish and Finnish, so it seems that an analysis using morphological features is most compelling there as well.
@jnivre : thank you for addressing this issue!
There is a lot of discussion of inflectional morphology above but less of derivational, so to illustrate, one specific problem with the proposed treatment of derivational morphology is shown in Tanaka et al. 2016 (page 4):
the suffix さ sa changes an adjective into a noun as in (5), and っぽい ppoi changes a noun into an adjective as in (6). [...] This kind of construction is very generative, and it is considered to be morphological (a word formation), rather than a syntactic relation.
mark(かわい, さ)
かわい ... さ
ADJ
PART
mark(子ども, っぽい)
子ども ... っぽい
NOUN
PART
As discussed above, UD does not involve morphological segmentation. The proposed representation has the obvious problem that as the derived forms (かわいさ
and 子どもっぽい
) do not appear as words, it is not possible to capture their derived parts of speech (NOUN
and ADJ
, resp.) anywhere.
The derivation is directly analogous to e.g. the derivation of "childish" or "childlike" from "child" in English. In UD, derived forms like these are words, annotated with their relevant parts of speech.
If this is not clearly stated in the documentation at the moment, we could perhaps formulate a rule along the following lines: In UD, forms created by derivational morphological processes should be represented as words tagged with their derived parts of speech. The root word and morphemes involved in the derivation are not separately represented in the UD analysis.
In UD, forms created by derivational morphological processes should be represented as words tagged with their derived parts of speech. The root word and morphemes involved in the derivation are not separately represented in the UD analysis.
This statement requires the definition of "derivational morphological processes". さ and っぽい are different from -ish and -like because they can be attached to any phrase or any clause.
My intention in the previous discussions is the boundary between morphology and syntax is often unclear, and we need a clear definition.
My intention in the previous discussions is the boundary between morphology and syntax is often unclear, and we need a clear definition.
Understood. We're hoping to establish cross-linguistically acceptable UD guidelines in a working group on the topic, and it would be great to have someone representing Japanese. Would you mind participating?
happy to join of course.
The main point is that "changing a noun into an adjective" is not a syntactic relation, so there is no appropriate UD relation to put on an arc between these two elements. This is a clear indication that this should be treated as word formation, not syntax, and it is exactly parallel to the argument against the old "DERIV" links in the Turkish treebank.
I see. Maybe the original explanation in Tanaka et al. 2016 was not very accurate. さ and っぽい do not only change a POS but they add some meaning, in a similar way to auxiliary verbs (they also change a POS in some cases).
@miyao-yusuke :
happy to join of course.
Great, thanks!
I'll put together a brief document summarizing some of the proposals so far and will get back to everyone interested in the topic next week.
さ and っぽい do not only changes a POS but they add some meaning
This does also hold for "-ish" and "-like" in English.
(Maybe of interest: Duncan A Freezing Approach to the Ish-Construction in English)
Sure, but I mean they are not purely functional. Maybe they can be considered like "child-like"or "child like".
But what is the syntactic relation?
That's the problem, but it should be similar to "child like".
Question on (Finnish) derivation lemmas: We tag the derivational forms according to the real POS tag ("childish" is tagged as ADJ, not NOUN), but sometimes, if Finnish morphological analyzer is not able to produce lemma for the final derived form, we keep the original form in the lemma field (here for example it would be the NOUN "child"). I assume that this is not the correct approach and we plan to correct this in v2, so that the lemma would always be the derived form. Does this sound correct?
@jmnybl Yes, this sounds completely correct to me.
I do not know whether there is a more relevant issue opened on this, but I am wondering why rules for tokenization and word segmentation do not (routinely) allow for two ore more graphic tokens to be univerbated for the sake of syntactic analysis (e.g., "in spite of"). The guidelines explicitly advise against that (while allowing univerbation for a few exceptional cases). I can imagine that this could add some formal complexity, but I would argue that would be worth it because of a consistent formal representation of the syntactic word.
PS: I also found this link (https://universaldependencies.org/v2/word-segmentation.html), but it does not work.
The guidelines define the relation fixed, which is used to connect parts of expressions like in spite of. You can think of that relation as technical means used to annotate (one particular type of) a syntactic word that consists of multiple orthographic words.
PS: I also found this link (https://universaldependencies.org/v2/word-segmentation.html), but it does not work.
That page has been moved to https://universaldependencies.org/v2/segmentation.html. If you know where you saw the link you can fix it there. The page is part of "final reports" during the discussion before v2 guidelines were announced and I don't remember whether it is really compatible with v2 as they were approved. If there are any discrepancies, the Guidelines section on the website should rule. But this page may have some additional discussion of examples.
The guidelines define the relation fixed, which is used to connect parts of expressions like in spite of. You can think of that relation as technical means used to annotate (one particular type of) a syntactic word that consists of multiple orthographic words.
This treatment seems to be in line with that of other "higher-order" technical dependencies, such as, for example, coordination. I have not yet reflected enough on these latter, but, in any case, I think that technical dependencies at the level of identification of the syntactic word impact formal representation consistency both crosslinguistically and intralinguistically significantly. More in general, it seems to me that we try to accommodate the orthography of a language much more than we should. Could this kind of technical dependency also be a "side effect" of the CoNLL-U format? If there were standoff annotation, this problem could be - at least formally - circumvented easily.
Recent discussions have suggested that the UD documentation could benefit from a more detailed definition of "word". We can use this issue to discuss the existing definition and possible improvements.
( @jnivre @dan-zeman @ftyers , others? )