Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 982 forks source link

Non-ascii symbols on systems without UTF-8 #6339

Closed ben-schwen closed 2 weeks ago

ben-schwen commented 2 months ago

Test of #4711 does not work in systems without UTF-8 encoding as e.g. our test-lin-rel-vanilla container.

Output of spinning up a new container with the image registry.gitlab.com/jangorecki/dockerfiles/r-base-gcc

DT = data.table(a = rep(1:3, 2))
setnames(DT, "a", "a\U00F1o")
DT[ , .N, 'a<U+00F1>o']
#>    a<U+00F1>o     N
#>         <int> <int>
#> 1:          1     2
#> 2:          2     2
#> 3:          3     2
#> Warning message:
#> In eval(bysub, x, parent.frame()) :
#>   unable to translate 'a<U+00F1>o' to native encoding
DT[ , .N, a<U+00F1>o]
#> Error: unexpected symbol in "DT[ , .N, a<U+00F1"
aitap commented 2 months ago

Bare variable names (symbols) are required to be in the native encoding. On systems incapable of representing ñ in the native encoding (LC_ALL=C, or, e.g., KOI8-R), there is no way to preserve an ñ in a variable name.

On non-UTF-8 systems that can represent ñ in the native encoding, the code will work fine:

$ LC_ALL=en_GB.ISO-8859-15 luit R -q -s -e 'as.name("\uf1"); parse(text = "DT[, .N, a\U00F1o]$N[1L]")'
ñ
expression(DT[, .N, año]$N[1L])

If there is no ñ in the current locale, translateChar() internally called by parse() substitutes some text and you get a syntax error, but iconv seems to help:

# this works
LC_ALL=en_GB.ISO-8859-15 luit R -q -s -e 'text <- iconv("DT[, .N, a\U00F1o]$N[1L]", "UTF-8", ""); if (!is.na(text)) parse(text = text)'
# expression(DT[, .N, año]$N[1L])

# this doesn't crash
LC_ALL=C R -q -s -e 'text <- iconv("DT[, .N, a\U00F1o]$N[1L]", "UTF-8", ""); if (!is.na(text)) parse(text = text)'
MichaelChirico commented 2 months ago

Thanks @aitap. What's luit?

iconv() looks as good a solution as any -- definitely good to still run those tests on non-UTF-8 systems, rather than just skip if parsing fails.

aitap commented 2 months ago

luit converts between the UTF-8 terminal session and the non-UTF-8 encoding used by its child process.

aitap commented 2 weeks ago

6559 demonstrates that we cannot rely on iconv() to return NA if conversion fails: on FreeBSD we instead get a?o.