Closed AngledLuffa closed 6 months ago
@rhdunn suggests a longer sequence would work:
https://github.com/UniversalDependencies/UD_English-EWT/issues/418#issuecomment-1783836598
Quoting @rhdunn:
NBSP has the Unicode value of U+00A0, so would something like
\x00A0
work?
I don't know the background on the SpacesAfter
policies in other treebanks but this sounds fine to me.
The background of SpacesAfter
is that it was introduced by @foxik in #332, so that UDPipe can represent the original plain text (even without the text
comment). I think other treebanks just adopted it (possibly because of using UDPipe for pre-annotation). If you parse a sentence with NBSP with UDPipe, it puts it into SpacesAfter
without any escaping:
# generator = UDPipe 2, https://lindat.mff.cuni.cz/services/udpipe
# udpipe_model = english-ewt-ud-2.12-230717
# udpipe_model_licence = CC BY-NC-SA
# newdoc
# newpar
# sent_id = 1
# text = A B
1 A A SYM SYM Number=Sing 2 compound _ SpacesAfter= |TokenRange=0:1
2 B B PROPN NNP Number=Sing 0 root _ SpaceAfter=No|TokenRange=2:3
(GitHub MarkDown changes the NBSP into a normal space, but the character after SpacesAfter=
is actually the NBSB.)
Validating this file results only in a warning [L2 Warning misc-extra-space]
.
However, if you run UDPipe without the "Save token ranges" option, there are no TokenRange
attributes and the validator gives us also an error [L1 Format trailing-whitespace]
in addition to the warning. This is bad because users expect that UDPipe produces no errors with validate.py --level 1
.
So either we add an exception to the validator for this use case, or we agree on an escaping of NPSB (e.g. \x00A0). I would just strongly suggest it is consulted with the author of UDPipe, @foxik. If \x00A0 is the agreed escaping, it should be documented and explicitly said whether any Unicode character can be escaped this way (perhaps except \s (space), \t (TAB), \r (CR), \n (LF), \p (pipe), \ (backslash), which are already defined and used) or whether it is only \x00A0.
The validator allows spaces because there can be transliteration of words with spaces or other attributes whose values are sequences of tokens, but those only need spaces in the middle of the value; that's why leading and trailing spaces are reported.
Perhaps it could be turned into a warning also when there are trailing spaces at the end of the MISC column.
The only characters that really must be escaped are TAB and LF. Also |, but that is not a space character, so it should not occur in SpacesAfter
. It will be probably wise to also escape CR and other control characers, although things like form feed or line tabulation probably won't occur. But from codepoint 32 (including it) up, they can be probably put there unescaped.
Interestingly, when I print the list of characters that match m/\s/
(which is the regular expression that the validator would use), I don't see the regular NBSP (codepoint 160) there:
9 CHARACTER TABULATION
10 LINE FEED
11 LINE TABULATION
12 FORM FEED
13 CARRIAGE RETURN
32 SPACE
5760 OGHAM SPACE MARK
6158 MONGOLIAN VOWEL SEPARATOR
8192 EN QUAD
8193 EM QUAD
8194 EN SPACE
8195 EM SPACE
8196 THREE-PER-EM SPACE
8197 FOUR-PER-EM SPACE
8198 SIX-PER-EM SPACE
8199 FIGURE SPACE
8200 PUNCTUATION SPACE
8201 THIN SPACE
8202 HAIR SPACE
8232 LINE SEPARATOR
8233 PARAGRAPH SEPARATOR
8239 NARROW NO-BREAK SPACE
8287 MEDIUM MATHEMATICAL SPACE
12288 IDEOGRAPHIC SPACE
But m/\pZ/
gives a different result:
32 SPACE
160 NO-BREAK SPACE
5760 OGHAM SPACE MARK
6158 MONGOLIAN VOWEL SEPARATOR
8192 EN QUAD
8193 EM QUAD
8194 EN SPACE
8195 EM SPACE
8196 THREE-PER-EM SPACE
8197 FOUR-PER-EM SPACE
8198 SIX-PER-EM SPACE
8199 FIGURE SPACE
8200 PUNCTUATION SPACE
8201 THIN SPACE
8202 HAIR SPACE
8232 LINE SEPARATOR
8233 PARAGRAPH SEPARATOR
8239 NARROW NO-BREAK SPACE
8287 MEDIUM MATHEMATICAL SPACE
12288 IDEOGRAPHIC SPACE
The pZ
regex is matching the Z
Unicode general category [1]. That does not include 9-13 as those are control characters (Cc
). The White_Space
property [2] includes those, though I'm not sure what regex would match them.
[1] https://www.unicode.org/reports/tr44/#General_Category_Values [2] https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
Any last minute thoughts? (Seeing as how the deadline is tomorrow and we've had this NBSP character in EWT for a few releases now)
Of these choices, I like \x00A0 the most, as it plays nicely with existing tools, is very readable, and probably won't come up that often anyway (most other treebanks have been happy to turn NBSP into a regular space!)
On a second thought, I don't think it is wise to allow trailing spaces in CoNLL-U. Some people (including myself) have set up their text editors to automatically remove trailing spaces, which could damage a CoNLL-U file if it can legitimately contain them. One could say that it is my problem and that I should switch that setting back and forth depending on whether I'm editing CoNLL-U or something else, but the fact that spaces are normally invisible speaks in favor of escaping them (if the line seemingly ends with SpacesAfter=
, you would have to move your cursor there to see if the value is empty string or a space or several spaces... not to speak about distinguishing the different kinds of spaces).
Also, the NBSP is not the only problematic character. One could argue that https://www.fileformat.info/info/unicode/char/2028/index.htm can be a similar problem (formally it is allowed anywhere, but some tools will consider it a line break); similarly https://www.fileformat.info/info/unicode/char/2029/index.htm, and any character of the Zs category https://www.fileformat.info/info/unicode/category/Zs/list.htm.
So adopting some universal approach of encoding such characters seems better than designing a special escape for NBSP. However, I would not use \x00A0
(it has complicated semantic in C/C++, and for example in Python only two hex digits are allowed after \x
); instead, I would go for \u00A0
with exactly four digits, which is used by JSON, Python, C/C++, ...
I think that's ok, as AFAICT all Unicode whitespaces are in the Basic Multilingual Plane (BMP) so fit into 4 hex digits. Otherwise, the \u...
syntax would need to support 4, 6, or 8 digits. Note that unlike e.g. Java which is UTF-16, the CoNLL-U files are UTF-8 so won't support UTF-16 surrogate pairs for characters beyond the BMP.
Yes, for more than BMP, we could use \Uxxxxxxxx
(as in C++ and Python). Alternatively, we could use the JSON/Rust syntax and use \u{x...x}
, where there are up to 6 hex digits. In JSON, you can have either exactly four digits after \u
, so \uABCD
, or you can include the braces, and then the size is dynamic, so for example \u{A0}
; in Rust, the braces are required.
I do not feel strongly about it, whether to use the C/Python with \u
and \U
, or use \u{...}
with braces.
I agree with @foxik's suggestion. Just to make sure: space, tab, CR, LF, pipe and backslash will be still required to be encoded using the legacy escape codes(\s, \t, \r, \n, \p and \, respectively). Only other non-printable characters will have to be encoded with the \u
(or \U
or \u{...}
) escape codes. Am I right?
Well, I would definitelykeep the \s \t \r \n \p \\
escapes (that is common in all languages I know), but I am not sure we would need to enforce that they have to be used.
\u0020
or \u{20}
\n
to \u{0A}
); the same for UDPipeSo maybe saying that the legacy codes are preferred, but \u
can be used to encode anything? Or maybe explicitly saying the validator would enforce that \u
is used only when necessary, but the format itself supports any character to be encoded using \u
?
The validator could possibly do it but so far it does not care about the meaning of SpacesAfter
, so if it says anything about it, it is the common warnings.
There's a sentence in English EWT which has an NBSP in it:
newsgroup-groups.google.com_n3td3v_e874a1e5eb995654_ENG_20060120_052200-0011
in the test setQuite some time ago we introduced SpacesAfter to account for whitespace other than a single space after a token:
https://github.com/UniversalDependencies/docs/issues/332
The
SpacesAfter
misc annotation allows for alternate space formulations, but NBSP is not part of it. The problem is that the validator does not like actual whitespace in the misc column. I propose that we add another escape sequence to cover NBSP. @nschneid looked at possibly putting it in quotes, but I believe that is not as clean as using an escape sequence of some kind.Perhaps
\b
would work, as the only character in NBSP which does not already represent something.\S
would be another option, but maybe less appealing since lowercase\s
already means something