gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
304 stars 44 forks source link

Whitespace in set operators is not properly ignored #329

Closed huftis closed 6 years ago

huftis commented 6 years ago

According to my reading of ?`stringi-search-charclass` and http://userguide.icu-project.org/strings/unicodeset, the expressions "[[:letter:]-[a-z]]+" and "[[:letter:] - [a-z]]+" should be equivalent (and the ICU user guide use the variant with spaces in several examples). However, they behave differently:

library(stringi)
x = "This is a     æøå-test\twith 3,14-7.89 number BLÅbærsyltetøy."
stri_extract_all(x, regex = "[[:letter:]-[a-z]]+")
#> [[1]]
#> [1] "T"   "æøå" "BLÅ" "æ"   "ø"
stri_extract_all(x, regex = "[[:letter:] - [a-z]]+")
#> [[1]]
#> [1] "This is a     æøå"      "test"                  
#> [3] "with "                  " number BLÅbærsyltetøy"

From my understanding, it’s the variant without spaces which work correctly.

gagolews commented 6 years ago

Yup, it's the notation with no spaces that is correct. The ICU manual is confusing, I guess they used the spaces in the examples as decorators, to improve readability

Consider filing a bug report at http://site.icu-project.org/bugs