UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

Definition of word #377

Open spyysalo opened 7 years ago

spyysalo commented 7 years ago

Recent discussions have suggested that the UD documentation could benefit from a more detailed definition of "word". We can use this issue to discuss the existing definition and possible improvements.

( @jnivre @dan-zeman @ftyers , others? )

akoehn commented 5 years ago

Could this kind of technical dependency also be a "side effect" of the CoNLL-U format?

No, it is – from a technical standpoint – quite possible to have whitespace in a token.

We have whitespace in tokens in the Hamburg Dependency Treebank (and the WIP UD conversion) in some rare cases, e.g. “Michael Jackson-Fans”. This is treated as a single word because the modifier “-Fans” applies to “Michael Jackson” as a whole and not just his last name. The canonical version would be “Michael-Jackson-Fans” but if the author of a text used the variant without Durchkopplung, we can’t do anything about it. Connecting “Michael” and “Jackson-Fans” with fixed would be incorrect IMO.

jnivre commented 5 years ago

Just for completeness: Another option that is sometimes relevant is the "goeswith" relation, which can be used, for example, if a word like "furthermore" is written "further more".

In general, I agree that the use of "technical" relations can be a problem, especially for users that are not familiar with UD and assumes that every relation is a true syntactic relation. It is the price we have to pay for encoding the whole (basic) syntactic structure as a simple spanning tree over words. Eliminating the "technical" relations would require a richer ontology.

martinpopel commented 5 years ago

I think the current solution (with the fixed relation) actually improves the consistency and usability of UD. Just that users need to be aware of the guidelines, but that is true for any guidelines and projects.

As for “Michael Jackson-Fans”: why not

flat:name(Michael, Jackson)
punct(Michael, -)
nmod(fans, Michael)

? I mean flat (similarly to fixed) means that the whole phrase act as a single syntactic unit with no inner dependencies (as the term flat suggests), so choosing the first word as head is just a technical way how to annotate it consistently. If there was any modifier of the first word ("Michael" in our case), it would be automatic considered as a modifier of the whole fixed phrase (multi-word expression).

sylvainkahane commented 5 years ago

Word segmentation is part of the syntactic analysis. It is not possible to decide independently of a syntactic studies what are the words of a language. The great advantage of a purely formal tokenisation (for instance based on orthography for languages which have such a tradition) allows us to postpone this decision when we annotate and realize the syntactic analysis. If you proceed with a double-blind annotation (as it is expected for a gold-standard treebank), it could be problematic to impose a word segmentation.

@gcelano "in spite of" is certainly not a word, whatever is your definition of a word. Even its status as a MWE is debatable, because "of" is part of the subcategorization frame (Fr. régime) of "in spite".

About the treatment of MWEs in UD we made some propositions at UDW17: Multi-word annotation in syntactic treebanks - Propositions for Universal Dependencies.

akoehn commented 5 years ago

As for “Michael Jackson-Fans”: why not flat:name(Michael, Jackson) punct(Michael, -) nmod(fans, Michael)

Then we would have to analyze "Michael-Jackson-Fans" and all other words with Durchkopplung in a similar fashion. We then would analyze "Foo-Bar" as flat(Foo, Bar), punct(Foo, -) but Foobar (which is just an alternative spelling of Foo-Bar) as a single word. One could continue that path and also analyze "Foobar" as two tokens but I think we all agree that this is not desireable.

This way, we special-case on the very rare occasion of "Michael Jackson-Fans". Otherwise we would get weird analyses in a lot of cases.

gcelano commented 5 years ago

If you proceed with a double-blind annotation (as it is expected for a gold-standard treebank), it could be problematic to impose a word segmentation.

In UD a word segmentation is already "imposed" after whitespace-based tokenization, in that graphic words can be regularly split.

@gcelano "in spite of" is certainly not a word, whatever is your definition of a word. Even its status as a MWE is debatable, because "of" is part of the subcategorization frame (Fr. régime) of "in spite".

I take "in spite of" as a syntactic word (on a par with "despite"). Of course, the idea is that one acknowledges that it now works as a unit (vs its more etymological analysis).

gcelano commented 5 years ago

No, it is – from a technical standpoint – quite possible to have whitespace in a token.

We have whitespace in tokens in the Hamburg Dependency Treebank (and the WIP UD conversion) in some rare cases, e.g. “Michael Jackson-Fans”. This is treated as a single word because the modifier “-Fans” applies to “Michael Jackson” as a whole and not just his last name. The canonical version would be “Michael-Jackson-Fans” but if the author of a text used the variant without Durchkopplung, we can’t do anything about it. Connecting “Michael” and “Jackson-Fans” with fixed would be incorrect IMO.

@akoehn, even if you (exceptionally) allow whitespaces within a syntactic word, your example still looks to me as a workaround dictated by the limitation of the format. Of course, one could potentially take "Michael(-)Jackson-Fans" as a single concept/syntactic word, but frankly, if you compare it to other similar cases, I would here identify "Michael Jackson" and "fans" as separate syntactic words.

akoehn commented 5 years ago

Let me prepend that a) I'm not a specialist on word segmentation stuff and b) these decisions were made by others long before my time with the HDT.

@gcelano, the question is up to which level structure should be annotated (and I don’t think that this is a technical limitation, it is a decision made by UD). We took the (I think reasonable) approach not to go to the sub-token level. E.g., “Blumentopferde” is not subdivided (it is a single token). Similarily, the variation “Blumentopf-Erde” is also not subdivided; it means the same thing and when subdividing it, it would be logical to also split “Blumentopferde”. On the other hand, if someone wrote “Blumentopf Erde” (which is incorrect spelling), we would treat it as two tokens.

This is mainly a question of where to stop; we decided to stop rather early with subdivision. If there is consensus amongst UD people that whitespace should always trigger a new token then I am fine with it. I have no strong opinions in either direction.

gcelano commented 5 years ago

@akoehn, the difference between “Blumentopferde” and “Blumentopf Erde” pertains only to orthography, not to syntax. Therefore, I would expect them to be syntactically analyzed in the same way (whatever analysis one choses). If we analyze them differently, this is a case where orthography unduly influences syntactic analysis. The question is to ensure that the same syntactic phenomena get the same analyses.

martinpopel commented 5 years ago

everyone: If you have not done so yet, read the current Tokenization and Word Segmentation guidelines.

@gcelano: I agree with @akoehn that there is no technical limitation of UD or CoNLL-U which would prevent words with spaces. In deed, in Vietnamese UD, spaces regularly occur inside words (because they mark syllable boundaries rather than word boundaries).

In UD a word segmentation is already "imposed" after whitespace-based tokenization

The tokenization in UD is not just "whitespace-based". Punctuation is separated as well. For most treebanks the tokenization rules should be simple. In CoNLL2018, the best tokenization result for most treebanks was > 99%, but the best result for an average over all treebanks was 98.42, because there were a few languages with low tokenization results (e.g. Thai, which does not use spaces and UD has no training data for Thai).

I am not sure what is meant by the claim that in UD a word segmentation is already imposed. In UD, word segmentation is more difficult than tokenization because of multiword tokens. In CoNLL2018, the best word-segmentation result for an average over all treebanks was 98.18 (it may seem the same as 98.42 for tokenization, but it means 41% increase of errors). For languages like Arabic and Hebrew the difference in difficulty is much higher (tokenization almost perfect, but word segmentation 96.81 and 93.98, respectively).

@akoehn: So "Michael Jackson" is analyzed as two words in the Hamburg Dependency Treebank, but (the rare) "Michael Jackson-fans" as a single word with a space? As always: when you improve the consistency in one phenomenon, you decrease it in another. Let's hope your users are more interested in Durchkopplung than Michael Jackson:-).

akoehn commented 5 years ago

So "Michael Jackson" is analyzed as two words in the Hamburg Dependency Treebank, but (the rare) "Michael Jackson-fans" as a single word with a space?

Exactly. It is not just rare but super-rare (I would guess <= 4 tokens out of 4 million).

gcelano commented 5 years ago

@martinpopel, the argument is that while splitting graphic words for the sake of identifying syntactic words is allowed, merging (usually) is not (even if there are clear cases where this would be required). This has as a consequence that syntactic analysis in UD has to accommodate this via the use of technical dependencies. While technical dependencies may not be a big problem intralinguistically (depending on the language), it is crosslinguistically.

On a side note: if two graphic words are to be merged, CoNLL-U could handle this rather easily if one directly follows the other; otherwise, it would be much more complex.

amir-zeldes commented 5 years ago

I think the argument for leaving compound modifiers inside tokens in German is not just a practical, but a theoretical one. German compound modifiers often have forms that do not correspond to any independent form of the noun (e.g. compounding truncation for Woll- from Wolle 'wool' in Wollknäuel 'ball of wool', or feminine compound linking 's' in Reinigungsfirma 'cleaning company').

This topic has been studied extensively in German morphology, and the general consensus is that these compounding stems do not constitute 'words' in German, but a kind of bound morpheme. In computational linguistics, the tradition has been very consistent in not making these into tokens. The "Micheal Jackson-Fans" example is implementing that guideline by saying 'if this is a compound modifier, it can't be a token'. I'm not necessarily sure it's worth the price of having whitespace in a token for the sake of this rare example, but I think the argument for it is actually one of theoretical consistency.

murawaki commented 5 years ago

Just posted a preprint on Japanese syntactic words in the context of UD Japanese: https://arxiv.org/abs/1906.09719