gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
304 stars 44 forks source link

stri_split_boundaries() does not split on Narrow No-Break Space #377

Closed kbenoit closed 3 years ago

kbenoit commented 4 years ago

\u202F is a Narrow No-Break Space, and unlike all other space characters in https://www.fileformat.info/info/unicode/category/Zs/list.htm, stri_split_boundaries(x, type = "word") does not split on this type of space. (It's also the type of space you can produce by accident in RStudio's editor by typing alt/option-space.)

However, it does split on this if followed by a non-word character.

txt <- c(
  "one\u0020two", # space
  "one\u00A0two", # No-Break Space (NBSP)
  "one\u1680two", # Ogham Space Mark
  "one\u2000two", # En Quad
  "one\u2001two", # Em Quad
  "one\u2002two", # En Space
  "one\u2003two", # Em Space
  "one\u2003two", # En Space
  "one\u2003two", # En Space
  "one\u2004two", # Three-Per-Em Space
  "one\u2005two", # Four-Per-Em Space
  "one\u2006two", # Six-Per-Em Space
  "one\u2007two", # Figure Space
  "one\u2008two", # Punctuation Space
  "one\u2009two", # Thin Space
  "one\u200Atwo", # Hair Space
  "one\u202Ftwo", # Narrow No-Break Space (NNBSP)
  "one\u205Ftwo", # Medium Mathematical Space (MMSP)
  "one\u3000two" # Ideographic Space
)

library("stringi")

stri_split_boundaries(txt, type = "word")
## [[1]]
## [1] "one" " "   "two"
## 
## [[2]]
## [1] "one" " "   "two"
## 
## [[3]]
## [1] "one" " "   "two"
## 
## [[4]]
## [1] "one" " "   "two"
## 
## [[5]]
## [1] "one" " "   "two"
## 
## [[6]]
## [1] "one" " "   "two"
## 
## [[7]]
## [1] "one" " "   "two"
## 
## [[8]]
## [1] "one" " "   "two"
## 
## [[9]]
## [1] "one" " "   "two"
## 
## [[10]]
## [1] "one" " "   "two"
## 
## [[11]]
## [1] "one" " "   "two"
## 
## [[12]]
## [1] "one" " "   "two"
## 
## [[13]]
## [1] "one" " "   "two"
## 
## [[14]]
## [1] "one" " "   "two"
## 
## [[15]]
## [1] "one" " "   "two"
## 
## [[16]]
## [1] "one" " "   "two"
## 
## [[17]]
## [1] "one two"
## 
## [[18]]
## [1] "one" " "   "two"
## 
## [[19]]
## [1] "one" " "  "two"

stri_split_boundaries("one\u202F@two", type = "word")
## [[1]]
## [1] "one " "@"    "two"

system information:

packageVersion("stringi")
## [1] '1.4.6'
stringi::stri_info()
## $Unicode.version
## [1] "10.0"
## 
## $ICU.version
## [1] "61.1"
## 
## $Locale
## $Locale$Language
## [1] "en"
## 
## $Locale$Country
## [1] "GB"
## 
## $Locale$Variant
## [1] ""
## 
## $Locale$Name
## [1] "en_GB"
## 
## 
## $Charset.internal
## [1] "UTF-8"  "UTF-16"
## 
## $Charset.native
## $Charset.native$Name.friendly
## [1] "UTF-8"
## 
## $Charset.native$Name.ICU
## [1] "UTF-8"
## 
## $Charset.native$Name.UTR22
## [1] NA
## 
## $Charset.native$Name.IBM
## [1] "ibm-1208"
## 
## $Charset.native$Name.WINDOWS
## [1] "windows-65001"
## 
## $Charset.native$Name.JAVA
## [1] "UTF-8"
## 
## $Charset.native$Name.IANA
## [1] "UTF-8"
## 
## $Charset.native$Name.MIME
## [1] "UTF-8"
## 
## $Charset.native$ASCII.subset
## [1] TRUE
## 
## $Charset.native$Unicode.1to1
## [1] NA
## 
## $Charset.native$CharSize.8bit
## [1] FALSE
## 
## $Charset.native$CharSize.min
## [1] 1
## 
## $Charset.native$CharSize.max
## [1] 3
## 
## 
## $ICU.system
## [1] FALSE
## 
## $ICU.UTF8
## [1] FALSE
koheiw commented 4 years ago

I researched a bit more about the issue, and noticed that there is a mention of NNBSP in the ICU document

http://site.icu-project.org/download/63

The French grouping separator changed from no-break space U+00A0 to narrow no-break space U+202F.

There are different way to group digits, and many central and eastern European countries use spaces instead of comma or period.

https://en.wikipedia.org/wiki/Decimal_separator

However, I don't see any change in the segmentation with different locales.

> stringi::stri_split_boundaries("x\u202Fy", type = "word", locale="fr_FR")
[[1]]
[1] "x y"

> stringi::stri_split_boundaries("x\u202Fy", type = "word", locale="en_US")
[[1]]
[1] "x y"
gagolews commented 4 years ago

This seems weird; I'd expect a no-break on both the No-Break Space (NBSP) as well as the Narrow No-Break Space (NNBSP)

stringi relies on ICU for word-boundary analysis, maybe you should file a bug report at https://unicode-org.atlassian.net/secure/Dashboard.jspa ?

gagolews commented 3 years ago

(thread inactive for >= 12 months; closing)