Open oxinabox opened 5 years ago
Words can contain spaces, for example in English the open compounds (e.g. "ice cream"). But there is no code point for inner-word spaces. The closest is the word joiner U+2060, but this has no visible length. This relates back to the decision to encode graphemes. So splitting at word boundaries is only possible with a dictionary, so my take is that splitting at white space is a sane default.
Hard to resist applying the "breaking" tag here. But what does it mean in this context? :joy:
Since this is breaking we can go a bit crazier: Maybe split should default only to splitting on ansi white-space
Maybe where white-space includes `,
\t,
\n,
\r`.
I think this would be U+202F NARROW NO-BREAK SPACE (NNBSP)
, as the unicode guide on word splitting (https://www.unicode.org/reports/tr29/#ExtendNumLetWB) says "Do not break from extenders." (NNBSP is the only character listed as both an extender and a space)
Rereading the thread, my view is "No, split
should not conflate U+0020 SPACE and
U+00A0 NO-BREAK SPACE." It is helpful to allow support for e.g. two word first names as a lexical unit.
Adopting a visual for NO-BREAK SPACE aor U+202F seems useful: U+2420 '␠' or U+2423 '␣' are used to show a space; '␣' has been used to show the spacebar's space.
split(str)
splits on U+00A0 : NO-BREAK SPACE [NBSP]Python does this also:
Is this the behavour one wants? By the defintions of nonbreaking space it is all about avoiding line-breaks placed in the wrong stop, so during type-setting it can be inserting.
However, I have seen it also used between single worrds that happen to contain spaces*. I don't know how common this is.
I have a set of embeddings for which reading was broken because it was using
split(line)
to break up the line, And the dataset was encoding words with spaces (or in this case multicharacter symbols with spaces) using nonbreaking spaces.My initial instinct was that non-breaking spaces should be treated as part of the words to either side. an thus not split on either. Now I am not so sure.
PCRE says that it does break up words:
(* English is a weird language sometimes since A) those are permitted, B) they are rarely acknolwedged. Sometimes you see that in names, e.g. the surname Diana Wynne Jones, the surname is Wynne Jones. Or
Anna Rose Smith
can haveRose
not as the middle name but as a compouind first name with a space:Anna Rose
)