UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
274 stars 249 forks source link

Tokenization of space-separated ellipsis. #988

Closed rhdunn closed 1 week ago

rhdunn commented 1 year ago

There are generally 3 ways to specify an ellipsis in text:

  1. as a sequence of 3 (or more) full-stop/period characters without spaces between them, e.g. ...;
  2. as a sequence of 3 (or more) full-stop/period characters with spaces between them, e.g. . . .;
  3. as a unicode ellipsis character, e.g. .

a) For the second case, the ellipsis is tokenized in EWT as 3 (or more) separate tokens. This is consistent with a space-based tokenizer, but is inconsistent with tokenizing the other cases as a single ellipsis token. -- Q: Should these be a single token?

My preference is yes, as they are linguistically a single punctuation token and can be substituted for any of the other forms while remaining equivalent.

b) For cases 1 and 2, where there are 4 (or more) . characters at the end of a sentence, should this be a single ellipsis as is currently annotated, or an ellipsis of n-1 . characters and a separate . token to end the sentence.

Linguistically, I would say it is the latter, but that makes it difficult to tokenize in a single pass (although you have that issue with abbreviations such as "Miss. Austen wrote English fiction.").

NOTE: EWT has several single token ellipsis that are labelled as SYM+NFP instead of as PUNCT+. or PUNCT+, like the other ellipsis tokens.

nschneid commented 1 year ago

Thanks for pointing these out.

a) For the second case, the ellipsis is tokenized in EWT as 3 (or more) separate tokens. This is consistent with a space-based tokenizer, but is inconsistent with tokenizing the other cases as a single ellipsis token. -- Q: Should these be a single token?

The UD tokenization/word segmentation policy is very strict about prohibiting spaces within syntactic words (i.e. units that have a dependency relation). As far as I'm aware, the only exceptions are 1) languages like Vietnamese where spaces indicate syllable rather than word boundaries, and 2) spaces within numerals for readability like "1 000 000" (where other orthographies might use commas or periods as separators). IMO, some variation in the spacing of punctuation marks is not important enough to warrant an additional exception.

b) For cases 1 and 2, where there are 4 (or more) . characters at the end of a sentence, should this be a single ellipsis as is currently annotated, or an ellipsis of n-1 . characters and a separate . token to end the sentence.

I don't necessarily have a strong opinion on this, and the policy may vary depending on whether it is a well-edited genre. Considering that EWT consists of web text, I wouldn't expect the use of three vs. four (or more) dots to be completely standard. We see things like

How many ellipses and/or periods is that? For EWT, at least, I'm happy with the current, simple approach that lumps them all together as one PUNCT token.

NOTE: EWT has several single token ellipsis that are labelled as SYM+NFP instead of as PUNCT+. or PUNCT+, like the other ellipsis tokens.

I'm just seeing these 5 which are standalone "sentences". Maybe this should be addressed as part of UniversalDependencies/UD_English-EWT#415.


As a general matter, I'd say UD is concerned with morphosyntax proper and less developed when it comes to issues like punctuation. If there are simple ways to make the analysis of punctuation cleaner/more consistent, then great, but we are cautious about departing from standards assumed by tokenizers—it will cause problems for parsers.

amir-zeldes commented 1 year ago

Another option for a) is to simply use goeswith (spelled apart, should've been spelled together)

arademaker commented 1 year ago

IMO, some variation in the spacing of punctuation marks is not important enough to warrant an additional exception.

why? I like the idea of “. . .” as single token.

amir-zeldes commented 1 year ago

why? I like the idea of “. . .” as single token.

I think essentially UD's way of saying that while there's a space in the string is goeswith

sylvainkahane commented 1 year ago

But goeswith is used when there is a misspelling, no? Do we consider that “. . .” is a misspelling?

martinpopel commented 1 year ago

Do we consider that “. . .” is a misspelling?

Yes, according to most typography guidelines, including a Czech one and an English one. That said, CMOS used to recommend using "three periods plus two nonbreaking spaces" - which could result in the same visual output as the Unicode ellipsis symbol if you hack the kerning rules (some fonts include these hacked kerning rules because their users could not use Unicode).

BTW: The Czech guideline says that in case of omitted characters, we can use as many dots as there are omitted characters, e.g. "Soviet cosmonaut G......" (Gagarin).

Seeing this issue (and similar recent issues) makes me feel good: UD treebanks seem to be so consistently annotated in the important aspects, so we can invest our time into such nitpicking. And then I look into the data and the feeling disappears.

amir-zeldes commented 1 year ago

And then I look into the data and the feeling disappears.

I feel your pain 😂

Do we consider that “. . .” is a misspelling?

If a student submitted a paper draft with that to me I would correct it