Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.63k stars 987 forks source link

Test C locale breaks test suite #6564

Open TysonStanley opened 1 month ago

TysonStanley commented 1 month ago

Something for after the patch release found in the release process (but don't believe it should stop the current patch release):

# Test C locale doesn't break test suite (#2771)
echo LC_ALL=C > ~/.Renviron
R
Sys.getlocale()=="C"
q("no")
R CMD check data.table_1.16.2.tar.gz
rm ~/.Renviron

This makes test.data.table() fail on MacOS Apple Silicon on test 2194.7

**** Full long double accuracy is not available. Tests using this will be skipped.

Running test id 2194.7          Test 2194.7 produced 0 errors but expected 1
Expected: Internal error.*types or lengths incorrect
aitap commented 1 month ago

Can also be reproduced on amd64 Linux (although multiple other tests also break due to <U+????> substitutions in conversions from UTF-8 to native encoding):

.libPaths(c('data.table.Rcheck', .libPaths()))
library(data.table)
trace(data.table:::endsWithAny, quote(if(identical(y, 'B')) browser())) # test 2194.7 compares with 'B'
test.data.table()
# same as data.table/inst/tests/issue_563_fread.txt'
Browse[1]> readLines(parent.frame(8)$env$testDir('issue_563_fread.txt'))
[1] "A,B"
Browse[1]> c
# later, at top level again
> readLines('inst/tests/issue_563_fread.txt')
[1] "A,B"               "\304\205,\305\276" "\305\253,\304\257"
[4] "\305\263,\304\227" "\305\241,\304\231"

Rconn_fgetc returns EOF after the first line because it's set to decode from UTF-8 into the native encoding, and iconv() fails to decode non-ASCII characters. This comes from file(encoding = getOption("encoding")), which is indeed set to UTF-8 by test.data.table: https://github.com/Rdatatable/data.table/blob/bb9faf65caf0ca366aa49c70b7dfb9e091108fe6/R/test.data.table.R#L92-L94

When giving a file path to readLines, there's no way around it calling file() with the default encoding=, so tests.Rraw will have to either manually open the file with a different encoding (in which the contents will be invalid!) or construct a different string to endsWithAny. In particular, ?file recomments creating an unopened connection marked as UTF-8 (file(open = '', encoding = 'UTF-8')) and giving it to readLines in order to read UTF-8 in an R session incapable of representing UTF-8 natively:

# context: options(encoding = 'UTF-8'), LC_ALL=C
con <- file('inst/tests/issue_563_fread.txt', open = '')
readLines(con)
# [1] "A,B"           "<U+0105>,<U+017E>" "<U+016B>,<U+012F>" "<U+0173>,<U+0117>"
# [5] "<U+0161>,<U+0119>"
close(con)

Unfortunately, readLines won't do it by itself: it uses file(open='r') which initialises UTF-8 → ASCII conversion and breaks.