Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 985 forks source link

encoding-related test failures on Alpine Linux #6350

Open bastistician opened 3 months ago

bastistician commented 3 months ago

(Only reporting now, seeing that data.table is being developed again.)

Checking released data.table 1.15.4, my Alpine Linux server gives

Error: 3 error(s) out of 11070. Search tests/tests.Rraw.bz2 for test number(s) 1590.05, 1590.06, 1997.14. Duration: 34.4s elapsed (34.9s cpu).

but at this point it is probably more useful to look at the development version of data.table.

So in a vanilla Alpine Linux container,

docker run --rm -it alpine

running

export TZ=UTC
apk add R R-dev R-doc
## get data.table (devel) and suggested packages
R -s -e 'install.packages("data.table", repos = "https://rdatatable.gitlab.io/data.table", dependencies = TRUE, destdir = "/tmp")'
export _R_CHECK_TESTS_NLINES_=0
R CMD check --extra-arch /tmp/data.table_*.tar.gz

gives only 2 failures for test numbers 1590.05 and 1590.06:

Error in test.data.table() ``` * using R version 4.4.0 (2024-04-24) * using platform: x86_64-pc-linux-musl * R was compiled by gcc (Alpine 13.2.1_git20240309) 13.2.1 20240309 GNU Fortran (Alpine 13.2.1_git20240309) 13.2.1 20240309 * running under: Alpine Linux v3.20 * using session charset: UTF-8 [...] Running the tests in ‘tests/main.R’ failed. Complete output: > require(data.table) Loading required package: data.table > > test.data.table() # runs the main test suite of 5,000+ tests in /inst/tests/tests.Rraw getDTthreads(verbose=TRUE): OpenMP version (_OPENMP) 201511 omp_get_num_procs() 12 R_DATATABLE_NUM_PROCS_PERCENT unset (default 50) R_DATATABLE_NUM_THREADS unset R_DATATABLE_THROTTLE unset (default 1024) omp_get_thread_limit() 2147483647 omp_get_max_threads() 12 OMP_THREAD_LIMIT unset OMP_NUM_THREADS unset RestoreAfterFork true data.table is using 6 threads with throttle==1024. See ?setDTthreads. test.data.table() running: //data.table.Rcheck/data.table/tests/tests.Rraw Test 1590.05 ran without errors but failed check that x equals y: > x = x1 != x2 First 1 of 1 (type 'logical'): [1] FALSE > y = TRUE First 1 of 1 (type 'logical'): [1] TRUE 1 element mismatch Test 1590.06 ran without errors but failed check that x equals y: > x = forderv(c(x2, x1, x1, x2)) First 0 of 0 (type 'integer'): integer(0) > y = INT(1, 4, 2, 3) First 4 of 4 (type 'integer'): [1] 1 4 2 3 Numeric: lengths (0, 4) differ Unloading package bit64 Sat Aug 3 13:25:45 2024 endian==little, sizeof(long double)==16, longdouble.digits==64, sizeof(pointer)==8, TZ=='UTC', Sys.timezone()=='UTC', Sys.getlocale()=='C.UTF-8;C;C;C;C;C', l10n_info()=='MBCS=TRUE; UTF-8=TRUE; Latin-1=FALSE; codeset=UTF-8', getDTthreads()=='OpenMP version (_OPENMP)==201511; omp_get_num_procs()==12; R_DATATABLE_NUM_PROCS_PERCENT==unset (default 50); R_DATATABLE_NUM_THREADS==unset; R_DATATABLE_THROTTLE==unset (default 1024); omp_get_thread_limit()==2147483647; omp_get_max_threads()==12; OMP_THREAD_LIMIT==unset; OMP_NUM_THREADS==unset; RestoreAfterFork==true; data.table is using 6 threads with throttle==1024. See ?setDTthreads.', .libPaths()=='//data.table.Rcheck','/usr/lib/R/library', zlibVersion()==1.3.1 ZLIB_VERSION==1.3.1 Error in test.data.table() : 2 error(s) out of 11369. Search tests/tests.Rraw for test number(s) 1590.05, 1590.06. Duration: 26.9s elapsed (29.1s cpu). ```

Here is the relevant R code, with comments indicating results on Alpine Linux:

x1 <- "fa\xE7ile"
Encoding(x1) <- "latin1"
x2 <- iconv(x1, "latin1", "UTF-8")
identical(x1, x2)  # TRUE, ok
x1 == x2           # TRUE, ok

Encoding(x2) <- "unknown"  #  <-- an invalid string in a non-UTF-8 locale
identical(x1, x2)  # TRUE on Alpine even in the C locale, but FALSE on, e.g., Ubuntu in the C locale
x1 == x2           # the same

It seems this test (1590.05) relies on (undocumented) platform-dependent behaviour for invalid strings, so should probably be dropped.

I cannot say anything about the unexpected length-0 result of data.table:::forderv(c(x2,x1,x1,x2)) (test number 1590.06).

MichaelChirico commented 3 months ago

The nearby comments look relevant:

test(1590.03, forderv(    c(x2,x1,x1,x2)), integer())     # desirable consistent result given identical(x1, x2)
                                                          #           ^^ data.table consistent over time regardless of which version of R or locale
baseR = base::order(c(x2,x1,x1,x2))
  # Even though C locale and identical(x1,x2), base R<=4.0.0 considers the encoding too; i.e. orders the encoding together x2 (UTF-8) before x1 (latin1).
  # Then around May 2020, R-devel (but just on Windows) started either respecting identical() like data.table has always done, or put latin1 before UTF-8.
  # Jan emailed R-devel on 23 May 2020.
  # We relaxed 1590.04 and 1590.07 (tests of base R behaviour) rather than remove them, PR#4492 and its follow-up. But these two tests
  # are so relaxed now that they barely testing anything. It appears base R behaviour is undefined in this rare case of identical strings in different encodings.

This will take some time to go through the history and figure out what this test was trying to do exactly and how to handle it.

Should we consider this a potential blocker for CRAN in the near future? We're just about to release a new version -- we can just deactivate those tests in the short term if needed.

bastistician commented 3 months ago

The report shows that these two tests are not portable. If they were disabled I could drop the --no-tests flag for data.table when mass-checking packages on Alpine Linux (against specific R patches).