Open oliviermeslin opened 3 years ago
OK, on Windows... Let me guess: does your data contain lots of non-ASCII strings? If so, have you tried to convert them to UTF-8 encoded first? If not, would you mind convert it to UTF8 encoded first (you may use my function below) and try again?
set_utf8_dt <- function(x) {
stopifnot(data.table::is.data.table(x))
key <- data.table::key(x)
cols <- colnames(x)
cols_str <- cols[vapply(x, is.character, logical(1L))]
for (col in cols_str) {
data.table::set(x, i = NULL, j = col, value = enc2utf8(x[[col]]))
}
data.table::setnames(x, cols, enc2utf8(cols))
if (!is.null(key)) data.table::setkeyv(x, enc2utf8(key))
invisible(x[])
}
Thank you for your detailed and well prepared report. Performance regression of setkey (internally forder) on character vector is a known issue. It was initially identified in #3928 and later in #4733. Not closing this as a duplicate because it has very useful code. Moreover I would like to also see if @shrektan suggestions made any difference.
@shrektan : no, my data does not contain non-ASCII strings. As you can see in my report, I generated artificial data made only of integers and doubles, so I don't think that the problem comes from encoding problems.
@jangorecki : thank you, and sorry if this issue is a duplicate, I'm not familiar with the data.table
repository. You can close it if you think it's appropriate.
After submitting the issue, I dived into the source files and found that the forder
function was modified several times between versions 1.11.8 and 1.12.0. As a matter of fact, I just discovered the verbose
option of setkey
(I edited the code above to add it). Rerunning the code with this option, it becomes clear that the problem comes from forder
being less performant than before.
My bad, I didn't notice that you included the data code, which is very nice :D
Unfortunately, I can't reproduce your result by using the following code on OSX, R4.0.3
library(data.table)
setDTthreads(4L) # use 1L or 4L to test if it's affected by the cores
set.seed(1L)
dt <- data.table::data.table(
x = as.character(sample(5e6L, 5e6L, FALSE)),
y = runif(100L))
system.time(
data.table::setkey(dt, x, verbose = TRUE)
)
Below are my results against v1.11.8 and the current dev version of data.table:
forder took 3.352 sec
reorder took 0.197 sec
user system elapsed
4.557 0.053 4.524
forder took 3.317 sec
reorder took 0.153 sec
user system elapsed
4.541 0.028 4.568
forder.c received 5000000 rows and 2 columns
forder took 7.14 sec
reorder took 0.069s elapsed (0.248s cpu)
user system elapsed
7.826 0.108 4.223
forder.c received 5000000 rows and 2 columns
forder took 3.514 sec
reorder took 0.135s elapsed (0.134s cpu)
user system elapsed
4.138 0.049 4.191
Maybe a Windows only issue?
Well, I still can't reproduce your results on Windows 10 x64, R4.0.1, with data.table v1.11.8 and the current dev version. The elapsed time is very close...
Note, I build the both versions of data.table from source and I don't know if this affects or not.
@shrektan building from source vs pre-compiled binaries can impact performance. Don't know how on windows but on linux some compiler flags can control that, like -mtune=native
.
@oliviermeslin could you paste following output?
readLines(system.file("cc", package="data.table"))
It gives the following output: "CC=gcc -std=gnu99" "CFLAGS=-O3"
. No idea what it means :smile:
@oliviermeslin These are compilation flags that compiler, gcc
in this case, used when translating C code into machine code. What could be helpful if you could install 1.13.2 from source and check if there is difference in performance.
You may also add -mtune=native
flag for compiler. This tells to compiler to optimize code for the current machine, which cannot be done when binaries are compiled on a different machine, like on CRAN.
To add this flag just create ~/.R/Makevars
file having following content
CC=gcc
CFLAGS=-O3 -mtune=native
Note that you need Rtools for compiling from source on Windows: https://cran.r-project.org/bin/windows/Rtools/
Thanks for your suggestion, but I think I installed all packages from source, including the 1.13.2. I also have Rtools on all my computers. Does the output of readLines(system.file("cc", package="data.table"))
suggest otherwise?
Not it doesn't.
I think we need to wait for revisit of forder
to figure out the fix performance regression.
Interestingly, I just reproduce it on R3.6.1. And I double confirm it's not reproducible on R4.0.1.
Rversion dt_version user.self sys.self elapsed
1: 3.6.1 1.11.8 6.39 0.28 6.98
2: 3.6.1 1.13.2 9.33 1.94 11.84
Rversion dt_version user.self sys.self elapsed
1: 4.0.1 1.11.8 6.53 0.63 7.78
2: 4.0.1 1.13.2 6.23 0.33 7.34
@jangorecki: I agree.
@shrektan: This is good news. I'm currently trying to run my code on my fourth server (a Linux one, this time), to see whether the problem is specific to Windows. I'll let you know if it finally works.
@jangorecki : you wrote in your first reply:
Performance regression of setkey (internally forder) on character vector is a known issue.
I just thought this morning that in my case the performance problem exists for both character and integer vectors. I don't know whether it matters for solving this issue.
@oliviermeslin thanks for pointing that out, then it is not strictly duplicate. On Windows it is generally more tricky due to being not that easily reproducible.
@shrektan : I re-ran all my tests on the new Windows 10 server, comparing several R versions. I confirm your finding: the performance problem of setkey
is not reproducible with R 4.0.2, but is present for R 3.3.3 and R 3.6.3. Maybe this can help to figure out where the problem comes from.
R version | data.table version |
user time | system time | elapsed time |
---|---|---|---|---|
3.3.3 | 1.10.4.3 | 6,83 | 0,09 | 6,80 |
3.6.3 | 1.10.4.3 | 9,70 | 0,13 | 9,67 |
4.0.2 | 1.10.4.3 | 8,10 | 0,11 | 8,08 |
3.3.3 | 1.11.8 | 6,97 | 0,11 | 6,94 |
3.6.3 | 1.11.8 | 10,08 | 0,08 | 9,99 |
4.0.2 | 1.11.8 | 8,03 | 0,11 | 8,00 |
3.3.3 | 1.12.0 | 10,31 | 14,41 | 66,55 |
3.6.3 | 1.12.0 | 12,92 | 13,25 | 82,96 |
4.0.2 | 1.12.0 | 8,97 | 4,33 | 8,22 |
3.3.3 | 1.13.0 | 9,19 | 9,79 | 68,68 |
3.6.3 | 1.13.0 | 8,43 | 7,61 | 66,22 |
4.0.2 | 1.13.0 | 7,09 | 0,75 | 6,95 |
3.3.3 | 1.13.2 | 11,78 | 20,98 | 69,75 |
3.6.3 | 1.13.2 | 12,50 | 20,18 | 66,33 |
4.0.2 | 1.13.2 | 7,41 | 0,64 | 7,17 |
This is amazing documentation. Regarding character vs. integer, is there profiling of an integer column only that shows performance degradation? The timings seemed based on as.character(sample(5e6L, 5e6L, FALSE))
. Note, I'd propose maybe closing the other similar issues; this is pretty definitive.
Also... since 4.0.2 addresses this, are issues ever closed by new versions of R?
About the version, since we depend on 3.1, if we can identify a root cause fix we can do on our side, we should do it. My guess is such fixes should usually translate to performance improvements at HEAD as well. That said, prioritization is harder.
I think generally users looking for best performance should be using recent R & recent data.table (and when that's not true it's a priority to fix/mitigate if there was some explicit tradeoff made). If indeed we can attribute it to R specifically, we can probably move on; it comes back to striving to understand the root cause.
Just my 2 cents
On Fri, Nov 13, 2020 at 7:59 PM Cole Miller notifications@github.com wrote:
This is amazing documentation. Regarding character vs. integer, is there profiling of an integer column only that shows performance degradation? The timings seemed based on as.character(sample(5e6L, 5e6L, FALSE)). Note, I'd propose maybe closing the other similar issues; this is pretty definitive.
Also... since 4.0.2 addresses this, are issues ever closed by new versions of R?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/4788#issuecomment-727104847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2BA5NOX3I3KDKO2CS4B3TSPXJAJANCNFSM4TFDW6DQ .
I just want to flag/remind that this kind of performance regression may be hard to reproduce
Interestingly, I just reproduce it on R3.6.1. And I double confirm it's not reproducible on R4.0.1.
R3.6.1
Rversion dt_version user.self sys.self elapsed 1: 3.6.1 1.11.8 6.39 0.28 6.98 2: 3.6.1 1.13.2 9.33 1.94 11.84
R4.0.1
Rversion dt_version user.self sys.self elapsed 1: 4.0.1 1.11.8 6.53 0.63 7.78 2: 4.0.1 1.13.2 6.23 0.33 7.34
In my experience, performance of R code can vary considerably from one machine to another. Differences can be observed not just in absolute run time (as expected) but even in the relative performance. For example, I have one Windows computer in which a particular inner join is 5 times as fast using data.table over dplyr::inner_join
and another Windows computer in which it is twice as slow! (So much so that I actually switch the method based on the value of Sys.getenv("COMPUTERNAME")
!)
I would keep other issues open and close them when fix will be ready and we will test the exact code examples there.
hi, I am trying to reproduce this issue, but I am unable to install neither data.table version 1.12.0, nor the prior version 1.11.8. (error at linker step, see below) (for info, data.table version 1.12.8 installs with compiler and linker warnings, but 1.12.6 or anything before that does not install -- error at linker step) I tried both R-4.3 and R-3.4.4, with gcc-10.1.0. any advice about how to reproduce the timings described in the original post?
* installing to library ‘/home/tdhock/lib/R/library’
* installing *source* package ‘data.table’ ...
** using staged installation
** libs
using C compiler: ‘gcc (GCC) 10.1.0’
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -Wstrict-prototypes -c assign.c -o assign.o
In file included from assign.c:1:
data.table.h:58:1: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
58 | void setSizes();
| ^~~~
data.table.h:88:1: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
88 | void savetl_init(), savetl(SEXP s), savetl_end();
| ^~~~
...
gcc -shared -L/home/tdhock/lib/R/lib -L/home/tdhock/lib -Wl,-rpath=/home/tdhock/lib -L/home/tdhock/.local/share/r-miniconda/lib -Wl,-rpath=/home/tdhock/.local/share/r-miniconda/lib -o data.table.so assign.o between.o bmerge.o chmatch.o dogroups.o fastmean.o fcast.o fmelt.o forder.o frank.o fread.o freadR.o fsort.o fwrite.o fwriteR.o gsumm.o ijoin.o init.o inrange.o nqrecreateindices.o openmp-utils.o quickselect.o rbindlist.o reorder.o shift.o subset.o transpose.o uniqlist.o vecseq.o wrappers.o -fopenmp -L/home/tdhock/lib/R/lib -lR
between.o:(.bss+0x0): multiple definition of `char_integer64'
assign.o:(.bss+0x0): first defined here
between.o:(.bss+0x8): multiple definition of `char_ITime'
assign.o:(.bss+0x8): first defined here
between.o:(.bss+0x10): multiple definition of `char_IDate'
assign.o:(.bss+0x10): first defined here
between.o:(.bss+0x18): multiple definition of `char_Date'
assign.o:(.bss+0x18): first defined here
...
wrappers.o:(.bss+0x68): multiple definition of `sym_starts'
assign.o:(.bss+0x68): first defined here
wrappers.o:(.bss+0x70): multiple definition of `char_starts'
assign.o:(.bss+0x70): first defined here
wrappers.o:(.bss+0x78): multiple definition of `sym_maxgrpn'
assign.o:(.bss+0x78): first defined here
wrappers.o:(.bss+0x80): multiple definition of `NA_INT64_D'
assign.o:(.bss+0x80): first defined here
wrappers.o:(.bss+0x88): multiple definition of `NA_INT64_LL'
assign.o:(.bss+0x88): first defined here
wrappers.o:(.bss+0xa0): multiple definition of `sizes'
assign.o:(.bss+0xa0): first defined here
wrappers.o:(.bss+0x3c0): multiple definition of `SelfRefSymbol'
assign.o:(.bss+0x3c0): first defined here
wrappers.o:(.bss+0x3c8): multiple definition of `twiddle'
assign.o:(.bss+0x3c8): first defined here
/usr/bin/ld: warning: /home/tdhock/.local/share/r-miniconda/lib/libgomp.so: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
/usr/bin/ld: warning: /home/tdhock/.local/share/r-miniconda/lib/libgomp.so: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
/usr/bin/ld: warning: /home/tdhock/.local/share/r-miniconda/lib/libgcc_s.so.1: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
/usr/bin/ld: warning: /home/tdhock/.local/share/r-miniconda/lib/libgcc_s.so.1: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
/usr/bin/ld: warning: /home/tdhock/.local/share/r-miniconda/lib/libgcc_s.so.1: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
/usr/bin/ld: warning: /home/tdhock/.local/share/r-miniconda/lib/libgcc_s.so.1: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
collect2: error: ld returned 1 exit status
/home/tdhock/lib/R/share/make/shlib.mk:10: recipe for target 'data.table.so' failed
make: *** [data.table.so] Error 1
ERROR: compilation failed for package ‘data.table’
looks like you're passing -Wstrict-prototypes
as a compilation flag, try turning that off
hi Michael, thanks for the advice.
I had that defined in CFLAGS in ~/.R/Makevars.
removing -Wstrict-prototypes
does remove those warnings, but it does not fix the linker issues.
Below I show the output, after having removed custom CFLAGS and LDFLAGS in ~/.R/Makevars.
(base) tdhock@maude-MacBookPro:~/R/data.table((no branch, bisect started on eed712ef))$ git checkout 1.11.8 && rm -f src/*.o && R CMD INSTALL .
HEAD is now at 76bb569f 1.11.8 submitted to CRAN. Bump to 1.11.9
Loading required package: grDevices
* installing to library ‘/home/tdhock/lib/R/library’
* installing *source* package ‘data.table’ ...
** using staged installation
** libs
using C compiler: ‘gcc (GCC) 10.1.0’
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c assign.c -o assign.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c between.c -o between.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c bmerge.c -o bmerge.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c chmatch.c -o chmatch.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c dogroups.c -o dogroups.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c fastmean.c -o fastmean.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c fcast.c -o fcast.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c fmelt.c -o fmelt.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c forder.c -o forder.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c frank.c -o frank.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c fread.c -o fread.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c freadR.c -o freadR.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c fsort.c -o fsort.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c fwrite.c -o fwrite.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c fwriteR.c -o fwriteR.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c gsumm.c -o gsumm.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c ijoin.c -o ijoin.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c init.c -o init.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c inrange.c -o inrange.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c nqrecreateindices.c -o nqrecreateindices.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c openmp-utils.c -o openmp-utils.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c quickselect.c -o quickselect.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c rbindlist.c -o rbindlist.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c reorder.c -o reorder.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c shift.c -o shift.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c subset.c -o subset.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c transpose.c -o transpose.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c uniqlist.c -o uniqlist.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c vecseq.c -o vecseq.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG -I/usr/local/include -fopenmp -fpic -g -O2 -c wrappers.c -o wrappers.o
gcc -shared -L/home/tdhock/lib/R/lib -L/usr/local/lib -o data.table.so assign.o between.o bmerge.o chmatch.o dogroups.o fastmean.o fcast.o fmelt.o forder.o frank.o fread.o freadR.o fsort.o fwrite.o fwriteR.o gsumm.o ijoin.o init.o inrange.o nqrecreateindices.o openmp-utils.o quickselect.o rbindlist.o reorder.o shift.o subset.o transpose.o uniqlist.o vecseq.o wrappers.o -fopenmp -L/home/tdhock/lib/R/lib -lR
between.o:/home/tdhock/R/data.table/src/data.table.h:95: multiple definition of `twiddle'
assign.o:/home/tdhock/R/data.table/src/data.table.h:95: first defined here
between.o:/home/tdhock/R/data.table/src/data.table.h:84: multiple definition of `SelfRefSymbol'
assign.o:/home/tdhock/R/data.table/src/data.table.h:84: first defined here
between.o:/home/tdhock/R/data.table/src/data.table.h:83: multiple definition of `sizes'
assign.o:/home/tdhock/R/data.table/src/data.table.h:83: first defined here
...
wrappers.o:/home/tdhock/R/data.table/src/data.table.h:59: multiple definition of `char_integer64'
assign.o:/home/tdhock/R/data.table/src/data.table.h:59: first defined here
collect2: error: ld returned 1 exit status
/home/tdhock/lib/R/share/make/shlib.mk:10: recipe for target 'data.table.so' failed
make: *** [data.table.so] Error 1
ERROR: compilation failed for package ‘data.table’
Does it work for you?
I get linker errors too, except I get 100s of them
The issue has been solved by newer versions of R. I think if we cannot reproduce it on R 4+ we could as well close the issue, rather than trying to examine R code that might have fixed that. As long as results in R < 4 are correct and the issue is speed only.
I am trying to reproduce, installed R-3.6.3 on windows from https://cloud.r-project.org/bin/windows/base/old/3.6.3/ then installed Rtools35.exe from https://cran.r-project.org/bin/windows/Rtools/history.html then put -std=c99 in my ~/.R/Makevars but I got an error about mman.h not found, does anybody know how to fix that?
th798@cmp2986 ~/R/data.table
$ git checkout 1.11.6
Note: switching to '1.11.6'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
git switch -c <new-branch-name>
Or undo this operation with:
git switch -
Turn off this advice by setting config variable advice.detachedHead to false
HEAD is now at a4e26b50 1.11.6 on CRAN. Bump to 1.11.7
th798@cmp2986 ~/R/data.table
$ R CMD INSTALL .
During startup - Warning message:
Setting LC_CTYPE=en_US.UTF-8 failed
* installing to library 'C:/Users/th798/R/win-library/3.6'
* installing *source* package 'data.table' ...
** using staged installation
** libs
*** arch - i386
c:/Rtools/mingw_32/bin/gcc -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG -f
openmp -Wformat-extra-args -std=c99 -c assign.c -o assign.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG -f
openmp -Wformat-extra-args -std=c99 -c between.c -o between.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG -f
openmp -Wformat-extra-args -std=c99 -c bmerge.c -o bmerge.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG -f
openmp -Wformat-extra-args -std=c99 -c chmatch.c -o chmatch.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG -f
openmp -Wformat-extra-args -std=c99 -c dogroups.c -o dogroups.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG -f
openmp -Wformat-extra-args -std=c99 -c fastmean.c -o fastmean.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG -f
openmp -Wformat-extra-args -std=c99 -c fcast.c -o fcast.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG -f
openmp -Wformat-extra-args -std=c99 -c fmelt.c -o fmelt.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG -f
openmp -Wformat-extra-args -std=c99 -c forder.c -o forder.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG -f
openmp -Wformat-extra-args -std=c99 -c frank.c -o frank.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG -f
openmp -Wformat-extra-args -std=c99 -c fread.c -o fread.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
fread.c:14:33: fatal error: sys/mman.h: No such file or directory
#include <sys/mman.h> // mmap
^
compilation terminated.
make: *** [fread.o] Error 1
ERROR: compilation failed for package 'data.table'
* removing 'C:/Users/th798/R/win-library/3.6/data.table'
* restoring previous 'C:/Users/th798/R/win-library/3.6/data.table'
actually this mman.h not found seems to be happening with recent R/rtools too, so I guess I need to figure that separate issue out first.
R Under development (unstable) (2023-11-26 r85638 ucrt) -- "Unsuffered Consequences"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> setwd('c:/Program Files/Emacs/x86_64/bin')
> install.packages("~/R/data.table",repos=NULL,type="source")
* installing *source* package 'data.table' ...
** using staged installation
**********************************************
WARNING: this package has a configure script
It probably needs manual configuration
**********************************************
** libs
using C compiler: 'gcc.exe (GCC) 12.2.0'
gcc -I"c:/PROGRA~1/R/R-devel/include" -DNDEBUG -I"C:/rtools43/x86_64-w64-mingw32.static.posix/include" -fopenmp -Wformat-extra-args -std=c99 -c fread.c -o fread.o
cc1.exe: warning: '-Wformat-extra-args' ignored without '-Wformat' [-Wformat-extra-args]
fread.c:16:12: fatal error: sys/mman.h: No such file or directory
16 | #include <sys/mman.h> // mmap
| ^~~~~~~~~~~~
compilation terminated.
make: *** [c:/PROGRA~1/R/R-devel/etc/x64/Makeconf:282: fread.o] Error 1
ERROR: compilation failed for package 'data.table'
* removing 'C:/Program Files/R/R-devel/library/data.table'
* restoring previous 'C:/Program Files/R/R-devel/library/data.table'
Warning message:
In install.packages("~/R/data.table", repos = NULL, type = "source") :
installation of package 'C:\Users\th798/R/data.table' had non-zero exit status
> Sys.which("gcc")
gcc
"C:\\rtools43\\X86_64~1.POS\\bin\\gcc.exe"
the mman.h not found error happens with -std=c99 flag (with current R, or old R).
Using old R-3.6.3 and gcc 12.3.0 I still get those linker errors. Maybe to reproduce we need an older compiler?
tl;dr
The
setkey
function is much slower in all versions ofdata.table
from 1.12.0.Summary
Context: I manipulate large datasets (~50 millions rows, 50 columns) with
data.table
on a daily basis. I work with three different computers : an old legacy Windows 2008 server, a more Windows 10 recent server, and my local computer. The available versions ofR
anddata.table
differ significantly in each setting.Problem: I noticed several times that the speed of the
setkey
function varies considerably depending on the setting I work in : for one of the datasets I work with (54 millions rows with a key uniquely identifying each row), thesetkey
call may take 2 seconds or 13 minutes.To make sure where this came from, I ran the same code with several versions of data.table, from 1.10.4 to 1.13.2 in the three settings. The code and all sessions info are below. I found every time the same result : the versions older than or equal to 1.11.8 are very fast, and later versions are much slower (approximately from 200 to 400 times).
Results
In this table, I put the results of the execution time of
setkey
on a fake dataset (5 millions rows), measured withsystem.time()
.data.table
versionCode
This code installs several versions of
data.table
in separate libraries, and measures the execution time ofsetkey
on an artificial dataset.Session Infos
Old Windows 2008 legacy server
New Windows 10 server
Local computer