UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

NBSP in the SpacesAfter misc annotation #982

Closed AngledLuffa closed 6 months ago

AngledLuffa commented 1 year ago

There's a sentence in English EWT which has an NBSP in it:

newsgroup-groups.google.com_n3td3v_e874a1e5eb995654_ENG_20060120_052200-0011 in the test set

Quite some time ago we introduced SpacesAfter to account for whitespace other than a single space after a token:

https://github.com/UniversalDependencies/docs/issues/332

The SpacesAfter misc annotation allows for alternate space formulations, but NBSP is not part of it. The problem is that the validator does not like actual whitespace in the misc column. I propose that we add another escape sequence to cover NBSP. @nschneid looked at possibly putting it in quotes, but I believe that is not as clean as using an escape sequence of some kind.

Perhaps \b would work, as the only character in NBSP which does not already represent something. \S would be another option, but maybe less appealing since lowercase \s already means something

AngledLuffa commented 1 year ago

@rhdunn suggests a longer sequence would work:

https://github.com/UniversalDependencies/UD_English-EWT/issues/418#issuecomment-1783836598

nschneid commented 1 year ago

Quoting @rhdunn:

NBSP has the Unicode value of U+00A0, so would something like \x00A0 work?

I don't know the background on the SpacesAfter policies in other treebanks but this sounds fine to me.

martinpopel commented 1 year ago

The background of SpacesAfter is that it was introduced by @foxik in #332, so that UDPipe can represent the original plain text (even without the text comment). I think other treebanks just adopted it (possibly because of using UDPipe for pre-annotation). If you parse a sentence with NBSP with UDPipe, it puts it into SpacesAfter without any escaping:

# generator = UDPipe 2, https://lindat.mff.cuni.cz/services/udpipe
# udpipe_model = english-ewt-ud-2.12-230717
# udpipe_model_licence = CC BY-NC-SA
# newdoc
# newpar
# sent_id = 1
# text = A B
1   A   A   SYM SYM Number=Sing 2   compound    _   SpacesAfter= |TokenRange=0:1
2   B   B   PROPN   NNP Number=Sing 0   root    _   SpaceAfter=No|TokenRange=2:3

(GitHub MarkDown changes the NBSP into a normal space, but the character after SpacesAfter= is actually the NBSB.) Validating this file results only in a warning [L2 Warning misc-extra-space]. However, if you run UDPipe without the "Save token ranges" option, there are no TokenRange attributes and the validator gives us also an error [L1 Format trailing-whitespace] in addition to the warning. This is bad because users expect that UDPipe produces no errors with validate.py --level 1.

So either we add an exception to the validator for this use case, or we agree on an escaping of NPSB (e.g. \x00A0). I would just strongly suggest it is consulted with the author of UDPipe, @foxik. If \x00A0 is the agreed escaping, it should be documented and explicitly said whether any Unicode character can be escaped this way (perhaps except \s (space), \t (TAB), \r (CR), \n (LF), \p (pipe), \ (backslash), which are already defined and used) or whether it is only \x00A0.

dan-zeman commented 1 year ago

The validator allows spaces because there can be transliteration of words with spaces or other attributes whose values are sequences of tokens, but those only need spaces in the middle of the value; that's why leading and trailing spaces are reported.

Perhaps it could be turned into a warning also when there are trailing spaces at the end of the MISC column.

The only characters that really must be escaped are TAB and LF. Also |, but that is not a space character, so it should not occur in SpacesAfter. It will be probably wise to also escape CR and other control characers, although things like form feed or line tabulation probably won't occur. But from codepoint 32 (including it) up, they can be probably put there unescaped.

Interestingly, when I print the list of characters that match m/\s/ (which is the regular expression that the validator would use), I don't see the regular NBSP (codepoint 160) there:

9       CHARACTER TABULATION
10      LINE FEED
11      LINE TABULATION
12      FORM FEED
13      CARRIAGE RETURN
32      SPACE
5760    OGHAM SPACE MARK
6158    MONGOLIAN VOWEL SEPARATOR
8192    EN QUAD
8193    EM QUAD
8194    EN SPACE
8195    EM SPACE
8196    THREE-PER-EM SPACE
8197    FOUR-PER-EM SPACE
8198    SIX-PER-EM SPACE
8199    FIGURE SPACE
8200    PUNCTUATION SPACE
8201    THIN SPACE
8202    HAIR SPACE
8232    LINE SEPARATOR
8233    PARAGRAPH SEPARATOR
8239    NARROW NO-BREAK SPACE
8287    MEDIUM MATHEMATICAL SPACE
12288   IDEOGRAPHIC SPACE

But m/\pZ/ gives a different result:

32      SPACE
160     NO-BREAK SPACE
5760    OGHAM SPACE MARK
6158    MONGOLIAN VOWEL SEPARATOR
8192    EN QUAD
8193    EM QUAD
8194    EN SPACE
8195    EM SPACE
8196    THREE-PER-EM SPACE
8197    FOUR-PER-EM SPACE
8198    SIX-PER-EM SPACE
8199    FIGURE SPACE
8200    PUNCTUATION SPACE
8201    THIN SPACE
8202    HAIR SPACE
8232    LINE SEPARATOR
8233    PARAGRAPH SEPARATOR
8239    NARROW NO-BREAK SPACE
8287    MEDIUM MATHEMATICAL SPACE
12288   IDEOGRAPHIC SPACE
rhdunn commented 1 year ago

The pZ regex is matching the Z Unicode general category [1]. That does not include 9-13 as those are control characters (Cc). The White_Space property [2] includes those, though I'm not sure what regex would match them.

[1] https://www.unicode.org/reports/tr44/#General_Category_Values [2] https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

AngledLuffa commented 1 year ago

Any last minute thoughts? (Seeing as how the deadline is tomorrow and we've had this NBSP character in EWT for a few releases now)

Of these choices, I like \x00A0 the most, as it plays nicely with existing tools, is very readable, and probably won't come up that often anyway (most other treebanks have been happy to turn NBSP into a regular space!)

dan-zeman commented 1 year ago

On a second thought, I don't think it is wise to allow trailing spaces in CoNLL-U. Some people (including myself) have set up their text editors to automatically remove trailing spaces, which could damage a CoNLL-U file if it can legitimately contain them. One could say that it is my problem and that I should switch that setting back and forth depending on whether I'm editing CoNLL-U or something else, but the fact that spaces are normally invisible speaks in favor of escaping them (if the line seemingly ends with SpacesAfter=, you would have to move your cursor there to see if the value is empty string or a space or several spaces... not to speak about distinguishing the different kinds of spaces).

foxik commented 1 year ago

Also, the NBSP is not the only problematic character. One could argue that https://www.fileformat.info/info/unicode/char/2028/index.htm can be a similar problem (formally it is allowed anywhere, but some tools will consider it a line break); similarly https://www.fileformat.info/info/unicode/char/2029/index.htm, and any character of the Zs category https://www.fileformat.info/info/unicode/category/Zs/list.htm.

So adopting some universal approach of encoding such characters seems better than designing a special escape for NBSP. However, I would not use \x00A0 (it has complicated semantic in C/C++, and for example in Python only two hex digits are allowed after \x); instead, I would go for \u00A0 with exactly four digits, which is used by JSON, Python, C/C++, ...

rhdunn commented 1 year ago

I think that's ok, as AFAICT all Unicode whitespaces are in the Basic Multilingual Plane (BMP) so fit into 4 hex digits. Otherwise, the \u... syntax would need to support 4, 6, or 8 digits. Note that unlike e.g. Java which is UTF-16, the CoNLL-U files are UTF-8 so won't support UTF-16 surrogate pairs for characters beyond the BMP.

foxik commented 1 year ago

Yes, for more than BMP, we could use \Uxxxxxxxx (as in C++ and Python). Alternatively, we could use the JSON/Rust syntax and use \u{x...x}, where there are up to 6 hex digits. In JSON, you can have either exactly four digits after \u, so \uABCD , or you can include the braces, and then the size is dynamic, so for example \u{A0}; in Rust, the braces are required.

I do not feel strongly about it, whether to use the C/Python with \u and \U, or use \u{...} with braces.

martinpopel commented 1 year ago

I agree with @foxik's suggestion. Just to make sure: space, tab, CR, LF, pipe and backslash will be still required to be encoded using the legacy escape codes(\s, \t, \r, \n, \p and \, respectively). Only other non-printable characters will have to be encoded with the \u (or \U or \u{...}) escape codes. Am I right?

foxik commented 1 year ago

Well, I would definitelykeep the \s \t \r \n \p \\ escapes (that is common in all languages I know), but I am not sure we would need to enforce that they have to be used.

So maybe saying that the legacy codes are preferred, but \u can be used to encode anything? Or maybe explicitly saying the validator would enforce that \u is used only when necessary, but the format itself supports any character to be encoded using \u?

dan-zeman commented 1 year ago

The validator could possibly do it but so far it does not care about the meaning of SpacesAfter, so if it says anything about it, it is the common warnings.