Closed kbenoit closed 3 years ago
I researched a bit more about the issue, and noticed that there is a mention of NNBSP in the ICU document
http://site.icu-project.org/download/63
The French grouping separator changed from no-break space U+00A0 to narrow no-break space U+202F.
There are different way to group digits, and many central and eastern European countries use spaces instead of comma or period.
https://en.wikipedia.org/wiki/Decimal_separator
However, I don't see any change in the segmentation with different locales.
> stringi::stri_split_boundaries("x\u202Fy", type = "word", locale="fr_FR")
[[1]]
[1] "x y"
> stringi::stri_split_boundaries("x\u202Fy", type = "word", locale="en_US")
[[1]]
[1] "x y"
This seems weird; I'd expect a no-break on both the No-Break Space (NBSP) as well as the Narrow No-Break Space (NNBSP)
stringi relies on ICU for word-boundary analysis, maybe you should file a bug report at https://unicode-org.atlassian.net/secure/Dashboard.jspa ?
(thread inactive for >= 12 months; closing)
\u202F is a Narrow No-Break Space, and unlike all other space characters in https://www.fileformat.info/info/unicode/category/Zs/list.htm,
stri_split_boundaries(x, type = "word")
does not split on this type of space. (It's also the type of space you can produce by accident in RStudio's editor by typing alt/option-space.)However, it does split on this if followed by a non-word character.
system information: