Open vvi56 opened 9 months ago
In English our policy is to retain the exact original string for each sentence. UniversalDependencies/UD_English-EWT#83 explains how we mark the token with SpecialEncoding=Yes
and use CorrectForm
to specify the normal spelling.
@nschneid yes, I see this case (U+00AD) in English-EWT.
Is the zero width space
(U+200B) the instance of the space character (in which case it should be skipped) or the character that must be retained as part of the token, such as the soft hyphen
(U+00AD) ?
Should we distinguish between normal spaces (U+0020) and all other spaces (Space Separator
Unicode category: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:General_Category=Space_Separator:] ) ?
Does it separate two words? If so I would say it should be encoded in UD with the SpacesAfter
feature rather than as part of the word itself.
I just fixed the Portuguese-PUD dataset. Thank you
According to the standard should space-like characters such as zero width space (U+200b) be included in the tokens or skipped like the normal space character ?
The guidelines say that "spaces" cannot occur in columns other than FORM, LEMMA and MISC, and in FORM and LEMMA they can occur only in expressions specifically defined for the language. However, it is not specified what types of spaces are meant.
The validator uses the \s
special character (from the regex module for Python) to identify a "space". It should match any character in the Unicode category Zs ("Separator, space"). It also matches tabs and newlines, which are in Unicode category C, but those are banned anyway and will result in a validation error elsewhere. U+200B ZERO WIDTH SPACE is in category Cf ("Other, format") and is not matched by \s
; therefore a UD token with this character is not invalid.
Nevertheless, the character is still intended to separate words rather than being their part. So if it is preserved in treebanks, it should probably be a separate token. And if it is a separate token, I cannot see it tagged and attached as anything else than punctuation – although I cannot say I like such a solution.
It would be also possible to say that the validator should consider both \s
and \x{200B}
as "spaces". But this is an ad-hoc solution and I don't know whether there are other characters (and how many) that would deserve the same treatment. We cannot exclude the whole C category because there are characters that should be allowed inside words.
Here is the list of all instances of \x{200B}
in FORM and LEMMA columns in UD 2.13:
be_hse-ud-{dev,train}.conllu : ~20 times, always in the same context
# text = <a_href="https://telegra.ph/file/ff4e408b49529ff1b14dc.jpg"><200b><200b></a><strong>Гэта сенсацыя!
1 <a_href="https://telegra.ph/file/ff4e408b49529ff1b14dc.jpg"> <a_href="https://telegra.ph/file/ff4e408b49529ff1b14dc.jpg"> X X _ 2 dep 2:dep SpaceAfter=No
2 <200b><200b> <200b><200b> X X _ 6 parataxis 6:parataxis SpaceAfter=No
3 </a> </a> X X _ 2 dep 2:dep SpaceAfter=No
ru_gsd-ud-train.conllu : twice in one sentence
# sent_id = train-s1679
# text = Тем не менее, выборы на следующий год принесли победу Немецкой демократической партии (DDP) и Кёлер вернулся на свой <200b> <200b> пост в качестве министра финансов.
...
21 свой свой DET PRP$ Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing 24 det _ _
22 <200b> <200b> X FW _ 24 amod _ _
23 <200b> <200b> X FW _ 24 amod _ _
24 пост пост NOUN NN Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing 19 obl _ _
...
tr_penn-ud-train.conllu : 4 times in different sentences
5 <200b><200b>meslektaşlarına <200b><200b>meslektaşlarına NOUN _ Case=Nom|Number=Sing|Person=3 6 obl _ _
10 <200b><200b>ayarlamak <200b><200b>ayarlamak NOUN _ Case=Nom|Number=Sing|Person=3 11 xcomp _ _
12 <200b><200b>yaptı <200b><200b>yaptı NOUN _ Case=Nom|Number=Sing|Person=3 4 conj _ _
6 <200b><200b>otomobil <200b><200b>otomobil NOUN _ Case=Nom|Number=Sing|Person=3 7 nmod _ _
zh_{gsd,gsdsimp}-ud-train.conllu : once
3 毒<200b><200b>物 毒<200b><200b>物 NOUN NN _ 21 nsubj _ SpaceAfter=No|Translit=dú<200b><200b>wù|LTranslit=dú<200b><200b>wù
pt_pud-ud-test.conllu : once, already fixed
There is only one character from the Space_Separator
category in the FORM and LEMMA column other than a normal space: \x{00A0}
(NO-BREAK SPACE):
br_keb-ud-test.conllu :
9 100<00A0>000 100 000 NUM num Number=Plur 6 nsubj _ _
6 16<00A0>345 16 345 NUM num Number=Plur 1 appos _ _
All other spaces in FORM and LEMMA columns are normal spaces (\x{0020}
) used to encode multiwords, numbers and formulas. ~13K occurrences of the normal space character in many corpora.
It would be also possible to say that the validator should consider both
\s
and\x{200B}
as "spaces". But this is an ad-hoc solution and I don't know whether there are other characters (and how many) that would deserve the same treatment. We cannot exclude the whole C category because there are characters that should be allowed inside words.
A refined rule for columns FORM and LEMMA could probably be: "The value cannot begin or end with a space-like character".
According to the standard should space-like characters such as zero width space (U+200b) be included in the tokens or skipped like the normal space character ?
Here are some examples:
be_hse-ud-dev.conllu
tr_penn-ud-train.conllu
pt_pud-ud-test.conllu