JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.78k stars 5.49k forks source link

Should `split(str)` split on Non-breaking spaces? #33833

Open oxinabox opened 5 years ago

oxinabox commented 5 years ago

split(str) splits on U+00A0 : NO-BREAK SPACE [NBSP]

julia> split("a b")
2-element Array{SubString{String},1}:
 "a"
 "b"

Python does this also:

>>> "a b".split()
['a', 'b']

Is this the behavour one wants? By the defintions of nonbreaking space it is all about avoiding line-breaks placed in the wrong stop, so during type-setting it can be inserting.

However, I have seen it also used between single worrds that happen to contain spaces*. I don't know how common this is.

I have a set of embeddings for which reading was broken because it was using split(line) to break up the line, And the dataset was encoding words with spaces (or in this case multicharacter symbols with spaces) using nonbreaking spaces.

My initial instinct was that non-breaking spaces should be treated as part of the words to either side. an thus not split on either. Now I am not so sure.

PCRE says that it does break up words:

julia> match(r".\b", "x y")
RegexMatch("x")

(* English is a weird language sometimes since A) those are permitted, B) they are rarely acknolwedged. Sometimes you see that in names, e.g. the surname Diana Wynne Jones, the surname is Wynne Jones. Or Anna Rose Smith can have Rose not as the middle name but as a compouind first name with a space: Anna Rose)

mschauer commented 5 years ago

Words can contain spaces, for example in English the open compounds (e.g. "ice cream"). But there is no code point for inner-word spaces. The closest is the word joiner U+2060, but this has no visible length. This relates back to the decision to encode graphemes. So splitting at word boundaries is only possible with a dictionary, so my take is that splitting at white space is a sane default.

JeffBezanson commented 5 years ago

Hard to resist applying the "breaking" tag here. But what does it mean in this context? :joy:

oxinabox commented 5 years ago

Since this is breaking we can go a bit crazier: Maybe split should default only to splitting on ansi white-space

JeffreySarnoff commented 5 years ago

Maybe where white-space includes `,\t,\n,\r`.

vtjnash commented 3 years ago

I think this would be U+202F NARROW NO-BREAK SPACE (NNBSP), as the unicode guide on word splitting (https://www.unicode.org/reports/tr29/#ExtendNumLetWB) says "Do not break from extenders." (NNBSP is the only character listed as both an extender and a space)

JeffreySarnoff commented 3 years ago

Rereading the thread, my view is "No, split should not conflate U+0020 SPACE and U+00A0 NO-BREAK SPACE." It is helpful to allow support for e.g. two word first names as a lexical unit. Adopting a visual for NO-BREAK SPACE aor U+202F seems useful: U+2420 '␠' or U+2423 '␣' are used to show a space; '␣' has been used to show the spacebar's space.