gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
304 stars 44 forks source link

`stri_trim_both()` has unexpected side effects when applied to strings containing special characters #416

Closed kinto-b closed 3 years ago

kinto-b commented 3 years ago
x <- c("xáx", "xöx", "xÉx", "xxáxx", "xxöxx", "xxÉxx")
y <- x
x <- stringi::stri_trim_both(x) 

# Strings appear to be identical
identical(x, y)
#> [1] TRUE

# But `gsub()` reveals a difference
identical(
    gsub("[[:lower:]]+", "", x, perl = TRUE),
    gsub("[[:lower:]]+", "", y, perl = TRUE)
)
#> [1] FALSE

# Something hinky is going on behind the scenes:
iconv(x)
#> [1] "xáx"   "xöx"   "xÉx"   "xxáxx" "xxöxx" "xxÉxx"
iconv(y)
#> [1] "xáx"   "xöx"   "xÉx"   "xxáxx" "xxöxx" "xxÉxx"

Created on 2021-04-09 by the reprex package (v1.0.0)

I think the issue might be related to R4 as a colleague who had yet to update did not encounter the same issue.

R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    
stringi::stri_info()

$Unicode.version
[1] "10.0"

$ICU.version
[1] "61.1"

$Locale
$Locale$Language
[1] "en"

$Locale$Country
[1] "AU"

$Locale$Variant
[1] ""

$Locale$Name
[1] "en_AU"

$Charset.internal
[1] "UTF-8"  "UTF-16"

$Charset.native
$Charset.native$Name.friendly
[1] "windows-1252"

$Charset.native$Name.ICU
[1] "ibm-5348_P100-1997"

$Charset.native$Name.UTR22
[1] "ibm-5348_P100-1997"

$Charset.native$Name.IBM
[1] "ibm-5348"

$Charset.native$Name.WINDOWS
[1] "windows-1252"

$Charset.native$Name.JAVA
[1] "windows-1252"

$Charset.native$Name.IANA
[1] "windows-1252"

$Charset.native$Name.MIME
[1] NA

$Charset.native$ASCII.subset
[1] TRUE

$Charset.native$Unicode.1to1
[1] TRUE

$Charset.native$CharSize.8bit
[1] TRUE

$Charset.native$CharSize.min
[1] 1

$Charset.native$CharSize.max
[1] 1

$ICU.system
[1] FALSE

$ICU.UTF8
[1] FALSE
gagolews commented 3 years ago

All functions in stringi convert their outputs to UTF-8.

Bytewise, x and y are not identical, because you are probably working in a non-UTF-8 native locale (refer to stringi::stri_info(FALSE)).

I would say this is rather a problem with the base R functions; (see the draft of a paper on stringi https://stringi.gagolewski.com/_static/vignette/stringi.pdf for more details).

Also, consider calling iconv(x, "", "utf-8") ?

kinto-b commented 3 years ago

I see what you mean:

x <- c("xáx", "xöx", "xÉx", "xxáxx", "xxöxx", "xxÉxx")
y <- x
x <- stringi::stri_trim_both(x) 

identical(
    iconv(x, "utf-8", "utf-8"),
    iconv(y, from = "ISO-8859-1", to = "utf-8")
)
#> [1] TRUE

I would say this is rather a problem with the base R function

Fair enough. I suppose the solution for me is to avoid mixing base string manipulation functions with stringi functions or else to be explicit about the encoding.

Thanks!

gagolews commented 3 years ago

Exactly, they are made to serve as replacements (with fixes) of the base ones.

PS You can also test with all(x == y).