Zero width spaces (U+200b) inside the token

vvi56 commented 9 months ago

According to the standard should space-like characters such as zero width space (U+200b) be included in the tokens or skipped like the normal space character ?

Here are some examples:

be_hse-ud-dev.conllu
tr_penn-ud-train.conllu
pt_pud-ud-test.conllu

nschneid commented 9 months ago

In English our policy is to retain the exact original string for each sentence. UniversalDependencies/UD_English-EWT#83 explains how we mark the token with SpecialEncoding=Yes and use CorrectForm to specify the normal spelling.

vvi56 commented 9 months ago

@nschneid yes, I see this case (U+00AD) in English-EWT.

Is the zero width space (U+200B) the instance of the space character (in which case it should be skipped) or the character that must be retained as part of the token, such as the soft hyphen (U+00AD) ?

Should we distinguish between normal spaces (U+0020) and all other spaces (Space Separator Unicode category: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:General_Category=Space_Separator:] ) ?

nschneid commented 9 months ago

Does it separate two words? If so I would say it should be encoded in UD with the SpacesAfter feature rather than as part of the word itself.

arademaker commented 9 months ago

I just fixed the Portuguese-PUD dataset. Thank you

dan-zeman commented 9 months ago

According to the standard should space-like characters such as zero width space (U+200b) be included in the tokens or skipped like the normal space character ?

The guidelines say that "spaces" cannot occur in columns other than FORM, LEMMA and MISC, and in FORM and LEMMA they can occur only in expressions specifically defined for the language. However, it is not specified what types of spaces are meant.

The validator uses the \s special character (from the regex module for Python) to identify a "space". It should match any character in the Unicode category Zs ("Separator, space"). It also matches tabs and newlines, which are in Unicode category C, but those are banned anyway and will result in a validation error elsewhere. U+200B ZERO WIDTH SPACE is in category Cf ("Other, format") and is not matched by \s; therefore a UD token with this character is not invalid.

Nevertheless, the character is still intended to separate words rather than being their part. So if it is preserved in treebanks, it should probably be a separate token. And if it is a separate token, I cannot see it tagged and attached as anything else than punctuation – although I cannot say I like such a solution.

It would be also possible to say that the validator should consider both \s and \x{200B} as "spaces". But this is an ad-hoc solution and I don't know whether there are other characters (and how many) that would deserve the same treatment. We cannot exclude the whole C category because there are characters that should be allowed inside words.

vvi56 commented 9 months ago

Here is the list of all instances of \x{200B} in FORM and LEMMA columns in UD 2.13:

be_hse-ud-{dev,train}.conllu : ~20 times, always in the same context

# text = <a_href="https://telegra.ph/file/ff4e408b49529ff1b14dc.jpg"><200b><200b></a><strong>Гэта сенсацыя!
1   <a_href="https://telegra.ph/file/ff4e408b49529ff1b14dc.jpg">    <a_href="https://telegra.ph/file/ff4e408b49529ff1b14dc.jpg">          X   X   _   2   dep 2:dep   SpaceAfter=No
2   <200b><200b>    <200b><200b>    X   X   _   6   parataxis   6:parataxis SpaceAfter=No
3   </a>    </a>    X   X   _   2   dep 2:dep   SpaceAfter=No

ru_gsd-ud-train.conllu : twice in one sentence

# sent_id = train-s1679
# text = Тем не менее, выборы на следующий год принесли победу Немецкой демократической партии (DDP) и Кёлер вернулся на свой <200b> <200b> пост в качестве министра финансов.
...
21  свой    свой    DET PRP$    Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing   24  det _   _
22  <200b>  <200b>  X   FW  _   24  amod    _   _
23  <200b>  <200b>  X   FW  _   24  amod    _   _
24  пост    пост    NOUN    NN  Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing   19  obl _   _
...

tr_penn-ud-train.conllu : 4 times in different sentences

5   <200b><200b>meslektaşlarına <200b><200b>meslektaşlarına NOUN    _   Case=Nom|Number=Sing|Person=3   6   obl _   _
10  <200b><200b>ayarlamak   <200b><200b>ayarlamak   NOUN    _   Case=Nom|Number=Sing|Person=3   11  xcomp   _   _
12  <200b><200b>yaptı   <200b><200b>yaptı   NOUN    _   Case=Nom|Number=Sing|Person=3   4   conj    _   _
6   <200b><200b>otomobil    <200b><200b>otomobil    NOUN    _   Case=Nom|Number=Sing|Person=3   7   nmod    _   _

zh_{gsd,gsdsimp}-ud-train.conllu : once

3   毒<200b><200b>物    毒<200b><200b>物    NOUN    NN  _   21  nsubj   _   SpaceAfter=No|Translit=dú<200b><200b>wù|LTranslit=dú<200b><200b>wù

pt_pud-ud-test.conllu : once, already fixed

vvi56 commented 9 months ago

There is only one character from the Space_Separator category in the FORM and LEMMA column other than a normal space: \x{00A0} (NO-BREAK SPACE):

br_keb-ud-test.conllu :

9   100<00A0>000 100 000 NUM num Number=Plur 6   nsubj   _   _
6   16<00A0>345  16 345  NUM num Number=Plur 1   appos   _   _

All other spaces in FORM and LEMMA columns are normal spaces (\x{0020}) used to encode multiwords, numbers and formulas. ~13K occurrences of the normal space character in many corpora.

vvi56 commented 9 months ago

It would be also possible to say that the validator should consider both \s and \x{200B} as "spaces". But this is an ad-hoc solution and I don't know whether there are other characters (and how many) that would deserve the same treatment. We cannot exclude the whole C category because there are characters that should be allowed inside words.

A refined rule for columns FORM and LEMMA could probably be: "The value cannot begin or end with a space-like character".

UniversalDependencies / docs

Zero width spaces (U+200b) inside the token #1010