Open MichaelChirico opened 1 month ago
@nunotexbsd, can you help to come up with an example of running by=
with a non-ASCII character? We'd like to ensure the behavior in #4708/fix in #4711 works on your system as well.
Following example on #4708:
> library(data.table)
> dt <- data.table(a=1:3,año=c(2020,2021,2023))
> dt[,sum(a),año]
año V1
<num> <int>
1: 2020 1
2: 2021 2
3: 2023 3
> dt[,,by=año]
a año
<int> <num>
1: 1 2020
2: 2 2021
3: 3 2023
Warning messages:
1: In `[.data.table`(dt, , , by = año) :
Ignoring by/keyby because 'j' is not supplied
2: In `[.data.table`(dt, , , by = año) :
i and j are both missing so ignoring the other arguments. This warning will be upgraded to error in future.
Did try variants like month==mês and it works fine too.
I didn't apply this patch yet to check testunit.
EDIT: execute dt[,,by=año]
R requires symbols to exist in the native encoding, so trying to construct a call to dt[,,by=año]
in an R session incapable of representing ñ in the native encoding will result in something else. Depending on the exact approach, we'll always get either a?o
or a<U+00F1>o
in either the name of the variable or the source code of the overall expression.
The real underlying problem of #4708 is that for x
of type STRSXP
the %chin%
operator performs additional conversion to UTF-8, but for x
of type SYMSXP
it only looks at the PRINTNAME(x)
pointer. The original reporter used a pre-UTF-8 build of R (LC_CTYPE=Spanish_Ecuador.1252
), so enc2native('año')
(from the symbol x
) and enc2utf8('año')
(from the converted table
) gave two different entries in the CHARSXP
cache, so pointer comparison failed. Forcing as.character(x)
switched Cchin
to the "force UTF-8 for strings" logic, making the match succeed.
Hmm, so @nunotexbsd what happens for this example then?
dt <- data.table(a=1:3,año=c(2020,2021,2023), `a?o` = 4:6)
dt[,sum(a), by=año]
I think that @nunotexbsd is running interactive examples in the "C.UTF-8" locale (where everything works) but unit tests in the "C" locale (where enc2native('año') returns 'a?o').
Hmm, so @nunotexbsd what happens for this example then?
dt <- data.table(a=1:3,año=c(2020,2021,2023), `a?o` = 4:6) dt[,sum(a), by=año]
dt <- data.table(a=1:3,año=c(2020,2021,2023), `a?o` = 4:6) dt[,sum(a), by=año] año V1 <num> <int> 1: 2020 1 2: 2021 2 3: 2023 3
_Originally posted by @MichaelChirico in https://github.com/Rdatatable/data.table/pull/6567#discussion_r1797156053_
As reported in #6559, the test will not run on FreeBSD. We should come up with an example that will test non-ASCII
by=
on such systems.