Definition of word - Githubissues

spyysalo commented 7 years ago

Recent discussions have suggested that the UD documentation could benefit from a more detailed definition of "word". We can use this issue to discuss the existing definition and possible improvements.

( @jnivre @dan-zeman @ftyers , others? )

spyysalo commented 7 years ago

Currently, http://universaldependencies.org/u/overview/tokenization.html reads in part

The UD annotation is based on a lexicalist view of syntax, which means that dependency relations hold between words [...] there is no attempt at segmenting words into morphemes. [...] the basic units of annotation are syntactic words (not phonological or orthographic words)

spyysalo commented 7 years ago

Joakim's Coling keynote included the following related heuristic:

What is a word?

Single part-of-speech tag

Real syntactic relation

spyysalo commented 7 years ago

Greg Pringle has written at length on the topic in the context of UD Japanese at http://www.cjvlang.com/Spicks/udjapanese.html . One specifically relevant part:

Linguists later fine-tuned the definition of words with further distributional criteria:

'Positional mobility (syntagmatic mobility)': the word is free to be used at different places in the sentence. For example, John will go can be transformed into Will John go?, indicating that will, John, and go are three separate words.

'Internal stability (internal immutability)': In contrast with the positional mobility that the word enjoys, morphemes within a word are fixed in order. For example, played is a stable unit that does not permit rearrangement as ed-play.

'Uninterruptability': it is not possible to insert anything between the morphemes of a word. For instance, it is not possible to insert anything between play and -ed (e.g., play-be-ed).

jnivre commented 7 years ago

This is a great initiative. However, Pringle's characterisation does not quite work for syntactic words, because it seems to exclude clitics. This paper is very relevant as well: http://coltekin.net/cagri/papers/coltekin2016turcling.pdf

spyysalo commented 7 years ago

Thanks! @coltekin proposes the following criteria

The term inflectional group (IG) in Turkish natural language processing literature refers to a sub-word unit. [...] the unit has been a de facto standard for representing words in Turkish NLP. [...] The current paper proposes tokenizing a surface word into multiple IGs only in case one of the following is true. a. Parts of the word may have potentially conflicting inflectional features. b. Parts of the word may participate in different syntactic relations.

jnivre commented 7 years ago

Clause a. only works for languages and words that inflect, but it can be generalized to postags, I think. Fusions are a case in point. Something like French "au" does not have conflicting inflectional features, because the preposition "à" does not inflect, but it has conflicting postags, ADP vs. DET.

dan-zeman commented 7 years ago

Wrt clitics, I was wondering how we would explain the difference between clitics attached to a verb (Spanish dámelo, vámonos) and morphemes that encode agreement of the verb with one or more arguments (and which could be interpreted as hidden pronouns when the respective argument does not appear as an overt nominal). Spanish vamos itself could be said to encode nosotros vamos, and head-marking languages like Basque may even encode agreement with a second and third argument, but we are quite clear about UD not analyzing these as separate words ("do not annotate things that are not there"). While I agree that we should not annotate dropped subjects (or arguments in general), I would love to hear a sound rule that says why vámonos is different.

One clear difference I can think of right now is exactly the possibility (in the agreement version) that the argument also appears as a separate word. However, I am not sure that a guideline based on this would work in languages that have clitic doubling.

jnivre commented 7 years ago

I agree that the distinction between pronominal clitics and inflectional endings that mark agreement is a bit fuzzy and that the former could develop into the latter historically. For the Spanish case, I would appeal to language-internal arguments about systematicity. Subject agreement is present in all verbs in Spanish, even with a separate overt subject, but "object agreement" is not. Therefore, when a morpheme referring to the object is glued to the verb, we avoid exceptions if we separate it from its host. How does that sound?

dan-zeman commented 7 years ago

On a more general note: I think that the default practice in UD is that we take orthographic words (in languages where they exist) as units. Only when we see good reasons to treat certain cases as contractions of multiple syntactic words, we split them; but if they work reasonably well as one word, we keep them together even though a parallel expression in a related language is rendered as multiple words. And there is some flexibility to these decisions, as always.

Conversely, when we have a good reason to say that one syntactic word spans several orthographic words, we connect them using the fixed relation, except for the very limited cases where we are allowed to include a space in a word form.

dan-zeman commented 7 years ago

@jnivre : sounds good to me, thanks. If this discussion results in a set of guidelines for UD-wordness, we should include the Spanish example with this explanation.

spyysalo commented 7 years ago

Stupid question: what's the rule against analyzing e.g. joined as join/VERB ed/AUX? Would it work the same if English didn't use space?

dan-zeman commented 7 years ago

@spyysalo : irregular verbs? If the English writing system did not use space, it would still not be clear where the auxiliary begins in threw.

spyysalo commented 7 years ago

Nice :-) How might that look as a general rule? Are there other arguments that could apply?

manning commented 7 years ago

There is a lot of literature in linguistics on distinguishing clitics from morphology. Among others, Arnold Zwicky has written a lot on this issue. Of course, as @jnivre notes, the results are not always crystal clear due to grammaticalization. There is an argument that some have pursued that romance "clitics" should really be treated as morphology. So I suspect that @dan-zeman's "On a more general note" is really part of the answer for us. But if you'd like to read more about how words, clitics, and morphology have been argued to be distinguishable, here's a little reading list:

Zwicky/Pullum 1983
Zwicky 1985
Anderson Short encyclopedia article
Holmstedt/Dresher Longer encyclopedia article
Good slides from Rik van Gijn

dan-zeman commented 7 years ago

Awesome, thanks!

spyysalo commented 7 years ago

@manning: thanks, that great! I'll dig in after recovering from Coling :-)

The reason I'm asking this particular question now is the proposed UD Japanese analysis of e.g. 食べた tabeta “ate” as 食べ/VERB た/AUX (with aux dependency: http://universaldependencies.org/ja/overview/syntax.html#special-clausal-dependents). 食べた strikes me as equally obviously a single-word past form verb as the English joined, and I'd like to know what arguments in the UD framework would support keeping this as one word.

amir-zeldes commented 7 years ago

@spyysalo I think what you're looking for is what this paper calls "MUW" (Middle Unit Word):

http://www.lrec-conf.org/proceedings/lrec2016/pdf/122_Paper.pdf

spyysalo commented 7 years ago

@amir-zeldes: this work is certainly highly relevant (and discussed at length in http://www.cjvlang.com/Spicks/udjapanese.html, referenced above), but unfortunately even the longest "LUW" (Long Unit Word) in the paper splits 食べた into 食べ/VERB andた/AUX (see Figure 1).

amir-zeldes commented 7 years ago

You're right, sorry for the confusion! I agree, it's odd for this to be a token while other agglutinative morphemes (e.g. potential) are not.

kanayamah commented 7 years ago

@spyysalo, one strong reason to split 食べた (ate) into 食べ/VERB and た/AUX is other elements such as a polite marker can be inserted between these two words (食べました - 食べ/VERB まし/AUX た/AUX). Considering English perfective and French passé composé, I think it is quite natural to represent a past form with two words.

dan-zeman commented 7 years ago

@kanayamah, considering simple past in English as well as in good many other languages, it is also quite natural to represent a past form with one word :-) Is the range of "words" that you can insert before た wider than just the politeness morpheme? Like in English "I have just eaten", "I have not eaten", "I have already eaten" (and the inserted thing is always indisputably an independent word).

Multiple affixes can attach to a stem in morphologically rich languages, but it does not necessarily warrant wordness of these affixes. For instance, the Czech adjective zvláštn-í "strange" ends with the morpheme í, which contributes the Case and Number features, and you can insert the comparative morpheme ejš before that (zvláštn-ějš-í "stranger") but it does not imply that either ejš or í are independent words.

amir-zeldes commented 7 years ago

@kanayamah I think one of the reasons some people are uncomfortable with this but have less problems with French is that the French auxiliary can stand by itself: it is white-space separated, it looks identical to either the verb 'be' or 'have' as used outside of the past tense construction, and it can be separated from the lexical verb by adverbs: 'il est déjà arrivé'. The morpheme た behaves more like an inflectional marker: its position is completely predictable and it cannot be used independently.

spyysalo commented 7 years ago

I agree with the points raised by @dan-zeman and @amir-zeldes. Specifically regarding the argument given by @kanayamah, I don't think the occurrence ofます between 食べ and た in 食べました is a reason to analyze た as an independent (syntactic) word rather than as a morpheme, as we can either consider 食べました as a word or た as an inflectional morpheme on ます.

spyysalo commented 7 years ago

@kanayamah : to clarify my concern a bit: the currently proposed UD Japanese representation implies the radical claim that Japanese has (effectively) no inflectional morphology. To the best of my knowledge, this is is a departure from both traditional and previous theoretical linguistic analyses of Japanese verb behavior.

Some of the published UD Japanese material suggests that you agree that the proposed (SUW) analysis segments into morphemes rather than syntactic words, e.g. http://www.lrec-conf.org/proceedings/lrec2016/pdf/122_Paper.pdf:

Short Unit Word (SUW): SUW is a minimal language unit that has a morphological function.

This kind of construction [...] is considered to be morphological (a word formation), rather than a syntactic relation. [regarding the derivation of e.g. かわいさ and 子どもっぽい]

However, the UD representation does not segment words into morphemes (see http://universaldependencies.org/u/overview/tokenization.html).

Are you using UD to represent the morphological structure of Japanese verbs (etc.), or does UD Japanese claim that Japanese verbs have no morphological structure?

miyao-yusuke commented 7 years ago

@dan-zeman @spyysalo Many other expressions (aspect, modality, passive, causative, benefactive, etc.) can appear between 食べ and た. Their composition is very systematic (no exceptions like "threw"), and some expressions can appear multiple times (e.g. double negation) and sometimes can be coordinated. I don't find any reason to prevent from segmenting them.

@spyysalo Modern syntactic theories consider these expressions as independent tokens. For example, the following book presents a comprehensive theory of Japanese syntax based on CCG, in which verb-following expressions are considered as independent tokens and given category S\S.

Bekki, Daisuke, 2010. Nihongo Bunpoo no Keesiki Riron: Katsuyoo taikee, Toogo koozoo, Imi goosee [Formal theory of Japanese grammar: The system of conjugation, syntactic structure, and semantic composition]. Japanese Frontier Series 24. Kurosio Publishers.

Anyway, I think we need a clear definition of "word" in the UD sense, which does not depend on the existence of spaces.

spyysalo commented 7 years ago

@miyao-yusuke : thanks for the quick response, your input is much appreciated! Some comments:

I think we need a clear definition of "word" in the UD sense, which does not depend on the existence of spaces.

Very much agreed! In the absence of more qualified volunteers, I've been trying to draft a rough proposal on this, but the more I read on attempts to define "word" cross-linguistically (e.g. Haspelmath 2011), the less confident I am...

no exceptions like "threw"

Wouldn't する → した, くる → きた, and いく → いった qualify as irregular past forms?

Modern syntactic theories consider these expressions as independent tokens. For example, [Bekki, Daisuke, 2010. Nihongo Bunpoo no Keesiki Riron: Katsuyoo taikee, Toogo koozoo, Imi goosee] presents a comprehensive theory of Japanese syntax based on CCG, in which verb-following expressions are considered as independent tokens

Thank you for the pointer! I could unfortunately not find a copy of 日本語文法の形式理論 (I take it) online, but will try to follow up on this. A couple of clarifying questions:

Is it OK to equate "token" with (syntactic) "word" here? That is, would it be correct to say Modern syntactic theories consider expressions such as た in 食べた as independent syntactic words?
Would it be correct to say that UD Japanese follows the approach of Bekki (2010) and rejects the view that Japanese verbs have past forms, analyzing past expression formation as a syntactic (rather than morphological) process?

Also, regarding the above statement from Tanaka et al. 2016, do you agree with

For example, the suffix さ sa changes an adjective into a noun [...] and っぽい ppoi changes a noun into an adjective [...] This kind of construction is very generative, and it is considered to be morphological (a word formation), rather than a syntactic relation.

i.e. to analyze the formation of e.g. かわいさ and 子どもっぽい as morphological rather than syntactic processes?

miyao-yusuke commented 7 years ago

Wouldn't する → した, くる → きた, and いく → いった qualify as irregular past forms?

Popular analysis is する is inflected to さ, し, せ, する, すれ, etc., and function words/morphemes attach to one of them; e.g. し + ない, ている, た, よう, etc. くる and いく are similar.

Is it OK to equate "token" with (syntactic) "word" here? That is, would it be correct to say Modern syntactic theories consider expressions such as た in 食べた as independent syntactic words?

They avoid defining "word". The important message here is that the distinction between word and suffix/morpheme is not very clear nor meaningful in Japanese, and considering everything as "syntactic unit" can explain various complications and interactions with syntax.

Would it be correct to say that UD Japanese follows the approach of Bekki (2010) and rejects the view that Japanese verbs have past forms, analyzing past expression formation as a syntactic (rather than morphological) process?

We are not following this specific theory, but follow the definition of SUW. However, I think we share the fundamental idea. Function words/morphemes of Japanese cannot be clearly classified into word or suffix/morpheme. Some tend to behave more like word, while others are closer to suffix, but most are somewhere in between. As far as I know, linguistics does not have a clear conclusion until now.

I'm not objecting to the inflection-based analysis. I mean both (and other possibilities in between) are acceptable and I cannot find any clear reason to reject one. We simply selected one of them (mainly due to practical reasons).

For example, the suffix さ sa changes an adjective into a noun [...] and っぽい ppoi changes a noun into an adjective [...] This kind of construction is very generative, and it is considered to be morphological (a word formation), rather than a syntactic relation. i.e. to analyze the formation of e.g. かわいさ and 子どもっぽい as morphological rather than syntactic processes?

These specific examples might look like derivational morphemes (actually we had some discussions in the Japanese UD team about these constructions). However, they also have many characteristics of "wordness" (actually, many of the "wordness" criteria of Zwicky, Haspelmath, etc. can apply to them).

jnivre commented 7 years ago

I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups. I still think we can use the paper by Cagri on principles for Turkish as a starting point, but it obviously needs to be refined and extended to cover typologically different languages. My own superficial impression of the Japanese situation is that we need a compromise solution, where some of the contested items are regarded as words and others as bound morphemes, but this does obviously not carry a lot of weight.

spyysalo commented 7 years ago

@miyao-yusuke :

The important message here is that the distinction between word and suffix/morpheme is not very clear nor meaningful in Japanese, and considering everything as "syntactic unit" can explain various complications and interactions with syntax.

I appreciate that, and I understand the merits of segmenting down to morphemes and representing morphological and syntactic structures in a unified fashion.

However, UD does not (currently :-)) address word formation, and I'm concerned that not defining a word unit above the morpheme level will have a negative impact on the value of UD Japanese annotations for cross-linguistic use cases.

I'm not objecting to the inflection-based analysis. I mean both (and other possibilities in between) are acceptable and I cannot find any clear reason to reject one.

I'm very happy to hear that, and I hope we can establish a good argument in support of a choice soon!

@jnivre :

I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups. [...] My own superficial impression of the Japanese situation is that we need a compromise solution, where some of the contested items are regarded as words and others as bound morphemes

+1 on both!

rtsarfaty commented 7 years ago

@jnivre :

I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups.

Great idea! happy to join the working group and represent the Semitic angle.

mojgan-seraji commented 7 years ago

I would also be happy to join the working group and represent the Indo-European languages with Perso-Arabic script. Word segmentation is a major challenge in the languages like Persian with regard to its special characteristics of different writing styles of "words".

spyysalo commented 7 years ago

I'd be happy to represent the Finnic language family :-)

jnivre commented 7 years ago

Following up on the discussion about UD Japanese, it seems clear that the segmentation of derivational morphology is not consistent with general UD principles, nor with practice in other UD treebanks, so I definitely think that should be changed. Similarly, when it comes to verbal morphology, all the descriptions so far point to this being a regular and systematic case of agglutination, similar to the situation in Turkish and Finnish, so it seems that an analysis using morphological features is most compelling there as well.

spyysalo commented 7 years ago

@jnivre : thank you for addressing this issue!

There is a lot of discussion of inflectional morphology above but less of derivational, so to illustrate, one specific problem with the proposed treatment of derivational morphology is shown in Tanaka et al. 2016 (page 4):

the suffix さ sa changes an adjective into a noun as in (5), and っぽい ppoi changes a noun into an adjective as in (6). [...] This kind of construction is very generative, and it is considered to be morphological (a word formation), rather than a syntactic relation.

mark(かわい, さ) かわい ... さ ADJ PART

mark(子ども, っぽい) 子ども ... っぽい NOUN PART

As discussed above, UD does not involve morphological segmentation. The proposed representation has the obvious problem that as the derived forms (かわいさ and 子どもっぽい) do not appear as words, it is not possible to capture their derived parts of speech (NOUN and ADJ, resp.) anywhere.

The derivation is directly analogous to e.g. the derivation of "childish" or "childlike" from "child" in English. In UD, derived forms like these are words, annotated with their relevant parts of speech.

If this is not clearly stated in the documentation at the moment, we could perhaps formulate a rule along the following lines: In UD, forms created by derivational morphological processes should be represented as words tagged with their derived parts of speech. The root word and morphemes involved in the derivation are not separately represented in the UD analysis.

miyao-yusuke commented 7 years ago

In UD, forms created by derivational morphological processes should be represented as words tagged with their derived parts of speech. The root word and morphemes involved in the derivation are not separately represented in the UD analysis.

This statement requires the definition of "derivational morphological processes". さ and っぽい are different from -ish and -like because they can be attached to any phrase or any clause.

My intention in the previous discussions is the boundary between morphology and syntax is often unclear, and we need a clear definition.

spyysalo commented 7 years ago

My intention in the previous discussions is the boundary between morphology and syntax is often unclear, and we need a clear definition.

Understood. We're hoping to establish cross-linguistically acceptable UD guidelines in a working group on the topic, and it would be great to have someone representing Japanese. Would you mind participating?

miyao-yusuke commented 7 years ago

happy to join of course.

jnivre commented 7 years ago

The main point is that "changing a noun into an adjective" is not a syntactic relation, so there is no appropriate UD relation to put on an arc between these two elements. This is a clear indication that this should be treated as word formation, not syntax, and it is exactly parallel to the argument against the old "DERIV" links in the Turkish treebank.

miyao-yusuke commented 7 years ago

I see. Maybe the original explanation in Tanaka et al. 2016 was not very accurate. さ and っぽい do not only change a POS but they add some meaning, in a similar way to auxiliary verbs (they also change a POS in some cases).

spyysalo commented 7 years ago

@miyao-yusuke :

happy to join of course.

Great, thanks!

I'll put together a brief document summarizing some of the proposals so far and will get back to everyone interested in the topic next week.

spyysalo commented 7 years ago

さ and っぽい do not only changes a POS but they add some meaning

This does also hold for "-ish" and "-like" in English.

(Maybe of interest: Duncan A Freezing Approach to the Ish-Construction in English)

miyao-yusuke commented 7 years ago

Sure, but I mean they are not purely functional. Maybe they can be considered like "child-like"or "child like".

jnivre commented 7 years ago

But what is the syntactic relation?

miyao-yusuke commented 7 years ago

That's the problem, but it should be similar to "child like".

jmnybl commented 7 years ago

Question on (Finnish) derivation lemmas: We tag the derivational forms according to the real POS tag ("childish" is tagged as ADJ, not NOUN), but sometimes, if Finnish morphological analyzer is not able to produce lemma for the final derived form, we keep the original form in the lemma field (here for example it would be the NOUN "child"). I assume that this is not the correct approach and we plan to correct this in v2, so that the lemma would always be the derived form. Does this sound correct?

jnivre commented 7 years ago

@jmnybl Yes, this sounds completely correct to me.

gcelano commented 5 years ago

I do not know whether there is a more relevant issue opened on this, but I am wondering why rules for tokenization and word segmentation do not (routinely) allow for two ore more graphic tokens to be univerbated for the sake of syntactic analysis (e.g., "in spite of"). The guidelines explicitly advise against that (while allowing univerbation for a few exceptional cases). I can imagine that this could add some formal complexity, but I would argue that would be worth it because of a consistent formal representation of the syntactic word.

PS: I also found this link (https://universaldependencies.org/v2/word-segmentation.html), but it does not work.

dan-zeman commented 5 years ago

The guidelines define the relation fixed, which is used to connect parts of expressions like in spite of. You can think of that relation as technical means used to annotate (one particular type of) a syntactic word that consists of multiple orthographic words.

dan-zeman commented 5 years ago

PS: I also found this link (https://universaldependencies.org/v2/word-segmentation.html), but it does not work.

That page has been moved to https://universaldependencies.org/v2/segmentation.html. If you know where you saw the link you can fix it there. The page is part of "final reports" during the discussion before v2 guidelines were announced and I don't remember whether it is really compatible with v2 as they were approved. If there are any discrepancies, the Guidelines section on the website should rule. But this page may have some additional discussion of examples.

gcelano commented 5 years ago

The guidelines define the relation fixed, which is used to connect parts of expressions like in spite of. You can think of that relation as technical means used to annotate (one particular type of) a syntactic word that consists of multiple orthographic words.

This treatment seems to be in line with that of other "higher-order" technical dependencies, such as, for example, coordination. I have not yet reflected enough on these latter, but, in any case, I think that technical dependencies at the level of identification of the syntactic word impact formal representation consistency both crosslinguistically and intralinguistically significantly. More in general, it seems to me that we try to accommodate the orthography of a language much more than we should. Could this kind of technical dependency also be a "side effect" of the CoNLL-U format? If there were standoff annotation, this problem could be - at least formally - circumvented easily.

UniversalDependencies / docs

Definition of word #377