Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 986 forks source link

Improve by=<non-ASCII> test to be more robust #6573

Open MichaelChirico opened 1 month ago

MichaelChirico commented 1 month ago
          Hmm, thanks. It still leaves us with the `else` branch here on FreeBSD where this is unanticipated / the behavior of `DT[,.N, by=año]` goes untested there. But let's start with un-breaking the test.

_Originally posted by @MichaelChirico in https://github.com/Rdatatable/data.table/pull/6567#discussion_r1797156053_

As reported in #6559, the test will not run on FreeBSD. We should come up with an example that will test non-ASCII by= on such systems.

MichaelChirico commented 1 month ago

@nunotexbsd, can you help to come up with an example of running by= with a non-ASCII character? We'd like to ensure the behavior in #4708/fix in #4711 works on your system as well.

nunotexbsd commented 1 month ago

Following example on #4708:

> library(data.table)
> dt <- data.table(a=1:3,año=c(2020,2021,2023))
> dt[,sum(a),año]
     año    V1
   <num> <int>
1:  2020     1
2:  2021     2
3:  2023     3
> dt[,,by=año]
       a   año
   <int> <num>
1:     1  2020
2:     2  2021
3:     3  2023
Warning messages:
1: In `[.data.table`(dt, , , by = año) :
  Ignoring by/keyby because 'j' is not supplied
2: In `[.data.table`(dt, , , by = año) :
  i and j are both missing so ignoring the other arguments. This warning will be upgraded to error in future.

Did try variants like month==mês and it works fine too.

I didn't apply this patch yet to check testunit.

EDIT: execute dt[,,by=año]

aitap commented 1 month ago

R requires symbols to exist in the native encoding, so trying to construct a call to dt[,,by=año] in an R session incapable of representing ñ in the native encoding will result in something else. Depending on the exact approach, we'll always get either a?o or a<U+00F1>o in either the name of the variable or the source code of the overall expression.

The real underlying problem of #4708 is that for x of type STRSXP the %chin% operator performs additional conversion to UTF-8, but for x of type SYMSXP it only looks at the PRINTNAME(x) pointer. The original reporter used a pre-UTF-8 build of R (LC_CTYPE=Spanish_Ecuador.1252), so enc2native('año') (from the symbol x) and enc2utf8('año') (from the converted table) gave two different entries in the CHARSXP cache, so pointer comparison failed. Forcing as.character(x) switched Cchin to the "force UTF-8 for strings" logic, making the match succeed.

MichaelChirico commented 1 month ago

Hmm, so @nunotexbsd what happens for this example then?

dt <- data.table(a=1:3,año=c(2020,2021,2023), `a?o` = 4:6)
dt[,sum(a), by=año]
aitap commented 1 month ago

I think that @nunotexbsd is running interactive examples in the "C.UTF-8" locale (where everything works) but unit tests in the "C" locale (where enc2native('año') returns 'a?o').

nunotexbsd commented 1 month ago

Hmm, so @nunotexbsd what happens for this example then?

dt <- data.table(a=1:3,año=c(2020,2021,2023), `a?o` = 4:6)
dt[,sum(a), by=año]
dt <- data.table(a=1:3,año=c(2020,2021,2023), `a?o` = 4:6)
dt[,sum(a), by=año]
año    V1
<num> <int>
1:  2020     1
2:  2021     2
3:  2023     3