Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.59k stars 978 forks source link

Low performance of setkey function for version 1.12.0 and later #4788

Open oliviermeslin opened 3 years ago

oliviermeslin commented 3 years ago

tl;dr

The setkey function is much slower in all versions of data.table from 1.12.0.

Summary

Context: I manipulate large datasets (~50 millions rows, 50 columns) with data.table on a daily basis. I work with three different computers : an old legacy Windows 2008 server, a more Windows 10 recent server, and my local computer. The available versions of R and data.table differ significantly in each setting.

Problem: I noticed several times that the speed of the setkey function varies considerably depending on the setting I work in : for one of the datasets I work with (54 millions rows with a key uniquely identifying each row), the setkey call may take 2 seconds or 13 minutes.

To make sure where this came from, I ran the same code with several versions of data.table, from 1.10.4 to 1.13.2 in the three settings. The code and all sessions info are below. I found every time the same result : the versions older than or equal to 1.11.8 are very fast, and later versions are much slower (approximately from 200 to 400 times).

Results

In this table, I put the results of the execution time of setkey on a fake dataset (5 millions rows), measured with system.time().

R version data.table version user time system time elapsed time
New server
3.6.3 1.10.4.3 5.72 0.15 5.73
3.6.3 1.11.0 5.52 0.15 5.53
3.6.3 1.11.2 5.54 0.14 5.55
3.6.3 1.11.4 5.64 0.11 5.61
3.6.3 1.11.6 5.48 0.19 5.55
3.6.3 1.11.8 5.54 0.23 5.64
3.6.3 1.12.0 11.67 19.15 76.69
3.6.3 1.12.8 8.58 10.78 65.73
3.6.3 1.13.2 10.55 24.20 70.95
Legacy server
3.3.3 1.10.4.3 3.57 0.08 3.52
3.3.3 1.11.0 3.61 0.07 3.56
3.3.3 1.11.2 3.37 0.17 3.42
3.3.3 1.11.4 3.55 0.03 3.47
3.3.3 1.11.6 3.45 0.06 3.38
3.3.3 1.11.8 3.46 0.06 3.43
3.3.3 1.12.0 8.47 16.19 39.59
3.3.3 1.12.8 8.81 16.91 39.54
3.3.3 1.13.2 8.14 13.61 39.49
Local computer
3.5.3 1.10.4.3 4.15 0.22 4.29
3.5.3 1.11.0 4.28 0.04 4.17
3.5.3 1.11.2 4.49 0.06 4.62
3.5.3 1.11.4 4.35 0.12 4.40
3.5.3 1.11.6 4.29 0.08 4.26
3.5.3 1.11.8 4.21 0.06 4.21
3.5.3 1.12.0 13.34 17.22 31.16
3.5.3 1.12.8 12.92 15.73 28.62
3.5.3 1.13.2 12.77 16.28 29.02

Code

This code installs several versions of data.table in separate libraries, and measures the execution time of setkey on an artificial dataset.

# The directory you use for the tests
test_dir <-  # Do not forget to define the directory

# Keep the R version for the final table
Rversion <- paste0(R.version$major, ".", R.version$minor)

# All the versions we will test
versions <- list(
  c(paste0("R", Rversion, "-", "dt1-10-4"), "1.10.4-3"),
  c(paste0("R", Rversion, "-", "dt1-11-0"), "1.11.0"),
  c(paste0("R", Rversion, "-", "dt1-11-2"), "1.11.2"),
  c(paste0("R", Rversion, "-", "dt1-11-4"), "1.11.4"),
  c(paste0("R", Rversion, "-", "dt1-11-6"), "1.11.6"),
  c(paste0("R", Rversion, "-", "dt1-11-8"), "1.11.8"),
  c(paste0("R", Rversion, "-", "dt1-12-0"), "1.12.0"),
  c(paste0("R", Rversion, "-", "dt1-12-8"), "1.12.8"),
  c(paste0("R", Rversion, "-", "dt1-13-2"), "1.13.2")
)

#############################
# Part 1: installing all data.table versions

# Function installing all versions of data.table in separate temporary libraries
install_version_dt <- function(infos) {

  package_lib <- paste0(test_dir, infos[1])
  package_version <- infos[2]

  # Create temporary library
  try(unlink(package_lib, recursive = TRUE))
  dir.create(package_lib)

  # Install package version
  devtools::install_version("data.table", version = package_version, lib = package_lib)
  try(unloadNamespace(data.table))
}

# Install all data.table versions
lapply(versions, install_version_dt)

#############################
# Part 2: measuring execution time of setkey

# Function measuring the execution time of setkey on artificial data
# with different versions of data.table
test_version_dt <- function(package_version, Rversion) {

  # Keep the old library paths
  old_libpath <- .libPaths()

  adresse_lib <- paste0(test_dir, package_version)
  .libPaths(adresse_lib)
  print(packageVersion("data.table"))
  dt_version <- packageVersion("data.table")

  library("data.table")    
  print(.libPaths())

  set.seed(1L)
  dt <- data.table::data.table(
    x = as.character(sample(5e6L, 5e6L, FALSE)), 
    y = runif(100L))

  results <- system.time(
    {
      data.table::setkey(dt, x, verbose = TRUE)
    }
  )

  # Make sure we unload the package
  try(unloadNamespace(data.table))
  try(detach('package:data.table', unload = TRUE))

  # Restore the old library paths
  .libPaths(old_libpath)
  print(.libPaths())

  return(
    list(
      "Rversion"   = Rversion, 
      "dt_version" = as.character(dt_version),
      "user.self"  = as.numeric(results["user.self"]),
      "sys.self"   = as.numeric(results["sys.self"]),
      "elapsed"    = as.numeric(results["elapsed"])))

}

# make the list of all temporary libraries
folder_list <- 
  c(
    unlist(lapply(versions, function(x) return(x[1])))
  )

results_list <- lapply(folder_list, test_version_dt, Rversion)

#############################
# Part 3: summarizing results

results_df <- data.table::rbindlist(results_list)
print(results_df)

Session Infos

Old Windows 2008 legacy server

R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.3.3

New Windows 10 server

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C                   LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.3 parallel_3.6.3 tools_3.6.3    Rcpp_1.0.5     fst_0.9.4 

Local computer

R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3        rstudioapi_0.11   magrittr_1.5      usethis_1.5.1    
 [5] devtools_2.2.2    pkgload_1.0.2     R6_2.4.1          rlang_0.4.6      
 [9] fansi_0.4.1       tools_3.5.3       pkgbuild_1.0.6    data.table_1.13.2
[13] sessioninfo_1.1.1 cli_2.0.2         withr_2.1.2       ellipsis_0.3.1   
[17] remotes_2.1.1     yaml_2.2.1        assertthat_0.2.1  digest_0.6.25    
[21] rprojroot_1.3-2   crayon_1.3.4      processx_3.4.2    callr_3.4.2      
[25] fs_1.3.1          ps_1.3.2          testthat_2.3.1    memoise_1.1.0    
[29] glue_1.4.1        compiler_3.5.3    desc_1.2.0        backports_1.1.5  
[33] prettyunits_1.1.1
shrektan commented 3 years ago

OK, on Windows... Let me guess: does your data contain lots of non-ASCII strings? If so, have you tried to convert them to UTF-8 encoded first? If not, would you mind convert it to UTF8 encoded first (you may use my function below) and try again?

set_utf8_dt <- function(x) {
  stopifnot(data.table::is.data.table(x))
  key <- data.table::key(x)
  cols <- colnames(x)
  cols_str <- cols[vapply(x, is.character, logical(1L))]
  for (col in cols_str) {
    data.table::set(x, i = NULL, j = col, value = enc2utf8(x[[col]]))
  }
  data.table::setnames(x, cols, enc2utf8(cols))
  if (!is.null(key)) data.table::setkeyv(x, enc2utf8(key))
  invisible(x[])
}
jangorecki commented 3 years ago

Thank you for your detailed and well prepared report. Performance regression of setkey (internally forder) on character vector is a known issue. It was initially identified in #3928 and later in #4733. Not closing this as a duplicate because it has very useful code. Moreover I would like to also see if @shrektan suggestions made any difference.

oliviermeslin commented 3 years ago

@shrektan : no, my data does not contain non-ASCII strings. As you can see in my report, I generated artificial data made only of integers and doubles, so I don't think that the problem comes from encoding problems.

oliviermeslin commented 3 years ago

@jangorecki : thank you, and sorry if this issue is a duplicate, I'm not familiar with the data.table repository. You can close it if you think it's appropriate.

After submitting the issue, I dived into the source files and found that the forder function was modified several times between versions 1.11.8 and 1.12.0. As a matter of fact, I just discovered the verbose option of setkey (I edited the code above to add it). Rerunning the code with this option, it becomes clear that the problem comes from forder being less performant than before.

shrektan commented 3 years ago

My bad, I didn't notice that you included the data code, which is very nice :D

Unfortunately, I can't reproduce your result by using the following code on OSX, R4.0.3

code

library(data.table)
setDTthreads(4L) # use 1L or 4L to test if it's affected by the cores
set.seed(1L)
dt <- data.table::data.table(
  x = as.character(sample(5e6L, 5e6L, FALSE)), 
  y = runif(100L))
system.time(
  data.table::setkey(dt, x, verbose = TRUE)
)

Below are my results against v1.11.8 and the current dev version of data.table:

v1.11.8

4 core

forder took 3.352 sec
reorder took 0.197 sec
   user  system elapsed 
  4.557   0.053   4.524 

1 core

forder took 3.317 sec
reorder took 0.153 sec
   user  system elapsed 
  4.541   0.028   4.568 

current dev version

4 core

forder.c received 5000000 rows and 2 columns
forder took 7.14 sec
reorder took 0.069s elapsed (0.248s cpu) 
   user  system elapsed 
  7.826   0.108   4.223

1 core

forder.c received 5000000 rows and 2 columns
forder took 3.514 sec
reorder took 0.135s elapsed (0.134s cpu) 
   user  system elapsed 
  4.138   0.049   4.191 

Maybe a Windows only issue?

shrektan commented 3 years ago

Well, I still can't reproduce your results on Windows 10 x64, R4.0.1, with data.table v1.11.8 and the current dev version. The elapsed time is very close...

Note, I build the both versions of data.table from source and I don't know if this affects or not.

jangorecki commented 3 years ago

@shrektan building from source vs pre-compiled binaries can impact performance. Don't know how on windows but on linux some compiler flags can control that, like -mtune=native. @oliviermeslin could you paste following output?

readLines(system.file("cc", package="data.table"))
oliviermeslin commented 3 years ago

It gives the following output: "CC=gcc -std=gnu99" "CFLAGS=-O3". No idea what it means :smile:

jangorecki commented 3 years ago

@oliviermeslin These are compilation flags that compiler, gcc in this case, used when translating C code into machine code. What could be helpful if you could install 1.13.2 from source and check if there is difference in performance. You may also add -mtune=native flag for compiler. This tells to compiler to optimize code for the current machine, which cannot be done when binaries are compiled on a different machine, like on CRAN. To add this flag just create ~/.R/Makevars file having following content

CC=gcc
CFLAGS=-O3 -mtune=native

Note that you need Rtools for compiling from source on Windows: https://cran.r-project.org/bin/windows/Rtools/

oliviermeslin commented 3 years ago

Thanks for your suggestion, but I think I installed all packages from source, including the 1.13.2. I also have Rtools on all my computers. Does the output of readLines(system.file("cc", package="data.table")) suggest otherwise?

jangorecki commented 3 years ago

Not it doesn't. I think we need to wait for revisit of forder to figure out the fix performance regression.

shrektan commented 3 years ago

Interestingly, I just reproduce it on R3.6.1. And I double confirm it's not reproducible on R4.0.1.

R3.6.1

   Rversion dt_version user.self sys.self elapsed
1:    3.6.1     1.11.8      6.39     0.28    6.98
2:    3.6.1     1.13.2      9.33     1.94   11.84

R4.0.1

   Rversion dt_version user.self sys.self elapsed
1:    4.0.1     1.11.8      6.53     0.63    7.78
2:    4.0.1     1.13.2      6.23     0.33    7.34
oliviermeslin commented 3 years ago

@jangorecki: I agree.

@shrektan: This is good news. I'm currently trying to run my code on my fourth server (a Linux one, this time), to see whether the problem is specific to Windows. I'll let you know if it finally works.

oliviermeslin commented 3 years ago

@jangorecki : you wrote in your first reply:

Performance regression of setkey (internally forder) on character vector is a known issue.

I just thought this morning that in my case the performance problem exists for both character and integer vectors. I don't know whether it matters for solving this issue.

jangorecki commented 3 years ago

@oliviermeslin thanks for pointing that out, then it is not strictly duplicate. On Windows it is generally more tricky due to being not that easily reproducible.

oliviermeslin commented 3 years ago

@shrektan : I re-ran all my tests on the new Windows 10 server, comparing several R versions. I confirm your finding: the performance problem of setkey is not reproducible with R 4.0.2, but is present for R 3.3.3 and R 3.6.3. Maybe this can help to figure out where the problem comes from.

R version data.table version user time system time elapsed time
3.3.3 1.10.4.3 6,83 0,09 6,80
3.6.3 1.10.4.3 9,70 0,13 9,67
4.0.2 1.10.4.3 8,10 0,11 8,08
3.3.3 1.11.8 6,97 0,11 6,94
3.6.3 1.11.8 10,08 0,08 9,99
4.0.2 1.11.8 8,03 0,11 8,00
3.3.3 1.12.0 10,31 14,41 66,55
3.6.3 1.12.0 12,92 13,25 82,96
4.0.2 1.12.0 8,97 4,33 8,22
3.3.3 1.13.0 9,19 9,79 68,68
3.6.3 1.13.0 8,43 7,61 66,22
4.0.2 1.13.0 7,09 0,75 6,95
3.3.3 1.13.2 11,78 20,98 69,75
3.6.3 1.13.2 12,50 20,18 66,33
4.0.2 1.13.2 7,41 0,64 7,17
ColeMiller1 commented 3 years ago

This is amazing documentation. Regarding character vs. integer, is there profiling of an integer column only that shows performance degradation? The timings seemed based on as.character(sample(5e6L, 5e6L, FALSE)). Note, I'd propose maybe closing the other similar issues; this is pretty definitive.

Also... since 4.0.2 addresses this, are issues ever closed by new versions of R?

MichaelChirico commented 3 years ago

About the version, since we depend on 3.1, if we can identify a root cause fix we can do on our side, we should do it. My guess is such fixes should usually translate to performance improvements at HEAD as well. That said, prioritization is harder.

I think generally users looking for best performance should be using recent R & recent data.table (and when that's not true it's a priority to fix/mitigate if there was some explicit tradeoff made). If indeed we can attribute it to R specifically, we can probably move on; it comes back to striving to understand the root cause.

Just my 2 cents

On Fri, Nov 13, 2020 at 7:59 PM Cole Miller notifications@github.com wrote:

This is amazing documentation. Regarding character vs. integer, is there profiling of an integer column only that shows performance degradation? The timings seemed based on as.character(sample(5e6L, 5e6L, FALSE)). Note, I'd propose maybe closing the other similar issues; this is pretty definitive.

Also... since 4.0.2 addresses this, are issues ever closed by new versions of R?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/4788#issuecomment-727104847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2BA5NOX3I3KDKO2CS4B3TSPXJAJANCNFSM4TFDW6DQ .

HughParsonage commented 3 years ago

I just want to flag/remind that this kind of performance regression may be hard to reproduce

Interestingly, I just reproduce it on R3.6.1. And I double confirm it's not reproducible on R4.0.1.

R3.6.1

   Rversion dt_version user.self sys.self elapsed
1:    3.6.1     1.11.8      6.39     0.28    6.98
2:    3.6.1     1.13.2      9.33     1.94   11.84

R4.0.1

   Rversion dt_version user.self sys.self elapsed
1:    4.0.1     1.11.8      6.53     0.63    7.78
2:    4.0.1     1.13.2      6.23     0.33    7.34

In my experience, performance of R code can vary considerably from one machine to another. Differences can be observed not just in absolute run time (as expected) but even in the relative performance. For example, I have one Windows computer in which a particular inner join is 5 times as fast using data.table over dplyr::inner_join and another Windows computer in which it is twice as slow! (So much so that I actually switch the method based on the value of Sys.getenv("COMPUTERNAME")!)

jangorecki commented 3 years ago

I would keep other issues open and close them when fix will be ready and we will test the exact code examples there.

tdhock commented 10 months ago

hi, I am trying to reproduce this issue, but I am unable to install neither data.table version 1.12.0, nor the prior version 1.11.8. (error at linker step, see below) (for info, data.table version 1.12.8 installs with compiler and linker warnings, but 1.12.6 or anything before that does not install -- error at linker step) I tried both R-4.3 and R-3.4.4, with gcc-10.1.0. any advice about how to reproduce the timings described in the original post?

* installing to library ‘/home/tdhock/lib/R/library’
* installing *source* package ‘data.table’ ...
** using staged installation
** libs
using C compiler: ‘gcc (GCC) 10.1.0’
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -Wstrict-prototypes  -c assign.c -o assign.o
In file included from assign.c:1:
data.table.h:58:1: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
   58 | void setSizes();
      | ^~~~
data.table.h:88:1: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
   88 | void savetl_init(), savetl(SEXP s), savetl_end();
      | ^~~~
...
gcc -shared -L/home/tdhock/lib/R/lib -L/home/tdhock/lib -Wl,-rpath=/home/tdhock/lib -L/home/tdhock/.local/share/r-miniconda/lib -Wl,-rpath=/home/tdhock/.local/share/r-miniconda/lib -o data.table.so assign.o between.o bmerge.o chmatch.o dogroups.o fastmean.o fcast.o fmelt.o forder.o frank.o fread.o freadR.o fsort.o fwrite.o fwriteR.o gsumm.o ijoin.o init.o inrange.o nqrecreateindices.o openmp-utils.o quickselect.o rbindlist.o reorder.o shift.o subset.o transpose.o uniqlist.o vecseq.o wrappers.o -fopenmp -L/home/tdhock/lib/R/lib -lR
between.o:(.bss+0x0): multiple definition of `char_integer64'
assign.o:(.bss+0x0): first defined here
between.o:(.bss+0x8): multiple definition of `char_ITime'
assign.o:(.bss+0x8): first defined here
between.o:(.bss+0x10): multiple definition of `char_IDate'
assign.o:(.bss+0x10): first defined here
between.o:(.bss+0x18): multiple definition of `char_Date'
assign.o:(.bss+0x18): first defined here
...
wrappers.o:(.bss+0x68): multiple definition of `sym_starts'
assign.o:(.bss+0x68): first defined here
wrappers.o:(.bss+0x70): multiple definition of `char_starts'
assign.o:(.bss+0x70): first defined here
wrappers.o:(.bss+0x78): multiple definition of `sym_maxgrpn'
assign.o:(.bss+0x78): first defined here
wrappers.o:(.bss+0x80): multiple definition of `NA_INT64_D'
assign.o:(.bss+0x80): first defined here
wrappers.o:(.bss+0x88): multiple definition of `NA_INT64_LL'
assign.o:(.bss+0x88): first defined here
wrappers.o:(.bss+0xa0): multiple definition of `sizes'
assign.o:(.bss+0xa0): first defined here
wrappers.o:(.bss+0x3c0): multiple definition of `SelfRefSymbol'
assign.o:(.bss+0x3c0): first defined here
wrappers.o:(.bss+0x3c8): multiple definition of `twiddle'
assign.o:(.bss+0x3c8): first defined here
/usr/bin/ld: warning: /home/tdhock/.local/share/r-miniconda/lib/libgomp.so: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
/usr/bin/ld: warning: /home/tdhock/.local/share/r-miniconda/lib/libgomp.so: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
/usr/bin/ld: warning: /home/tdhock/.local/share/r-miniconda/lib/libgcc_s.so.1: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
/usr/bin/ld: warning: /home/tdhock/.local/share/r-miniconda/lib/libgcc_s.so.1: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
/usr/bin/ld: warning: /home/tdhock/.local/share/r-miniconda/lib/libgcc_s.so.1: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
/usr/bin/ld: warning: /home/tdhock/.local/share/r-miniconda/lib/libgcc_s.so.1: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
collect2: error: ld returned 1 exit status
/home/tdhock/lib/R/share/make/shlib.mk:10: recipe for target 'data.table.so' failed
make: *** [data.table.so] Error 1
ERROR: compilation failed for package ‘data.table’
MichaelChirico commented 10 months ago

looks like you're passing -Wstrict-prototypes as a compilation flag, try turning that off

tdhock commented 10 months ago

hi Michael, thanks for the advice. I had that defined in CFLAGS in ~/.R/Makevars. removing -Wstrict-prototypes does remove those warnings, but it does not fix the linker issues. Below I show the output, after having removed custom CFLAGS and LDFLAGS in ~/.R/Makevars.

(base) tdhock@maude-MacBookPro:~/R/data.table((no branch, bisect started on eed712ef))$ git checkout 1.11.8 && rm -f src/*.o && R CMD INSTALL .  
HEAD is now at 76bb569f 1.11.8 submitted to CRAN. Bump to 1.11.9
Loading required package: grDevices
* installing to library ‘/home/tdhock/lib/R/library’
* installing *source* package ‘data.table’ ...
** using staged installation
** libs
using C compiler: ‘gcc (GCC) 10.1.0’
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c assign.c -o assign.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c between.c -o between.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c bmerge.c -o bmerge.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c chmatch.c -o chmatch.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c dogroups.c -o dogroups.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fastmean.c -o fastmean.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fcast.c -o fcast.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fmelt.c -o fmelt.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c forder.c -o forder.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c frank.c -o frank.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fread.c -o fread.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c freadR.c -o freadR.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fsort.c -o fsort.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fwrite.c -o fwrite.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fwriteR.c -o fwriteR.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c gsumm.c -o gsumm.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c ijoin.c -o ijoin.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c init.c -o init.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c inrange.c -o inrange.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c nqrecreateindices.c -o nqrecreateindices.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c openmp-utils.c -o openmp-utils.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c quickselect.c -o quickselect.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c rbindlist.c -o rbindlist.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c reorder.c -o reorder.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c shift.c -o shift.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c subset.c -o subset.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c transpose.c -o transpose.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c uniqlist.c -o uniqlist.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c vecseq.c -o vecseq.o
gcc -I"/home/tdhock/lib/R/include" -DNDEBUG   -I/usr/local/include   -fopenmp -fpic  -g -O2  -c wrappers.c -o wrappers.o
gcc -shared -L/home/tdhock/lib/R/lib -L/usr/local/lib -o data.table.so assign.o between.o bmerge.o chmatch.o dogroups.o fastmean.o fcast.o fmelt.o forder.o frank.o fread.o freadR.o fsort.o fwrite.o fwriteR.o gsumm.o ijoin.o init.o inrange.o nqrecreateindices.o openmp-utils.o quickselect.o rbindlist.o reorder.o shift.o subset.o transpose.o uniqlist.o vecseq.o wrappers.o -fopenmp -L/home/tdhock/lib/R/lib -lR
between.o:/home/tdhock/R/data.table/src/data.table.h:95: multiple definition of `twiddle'
assign.o:/home/tdhock/R/data.table/src/data.table.h:95: first defined here
between.o:/home/tdhock/R/data.table/src/data.table.h:84: multiple definition of `SelfRefSymbol'
assign.o:/home/tdhock/R/data.table/src/data.table.h:84: first defined here
between.o:/home/tdhock/R/data.table/src/data.table.h:83: multiple definition of `sizes'
assign.o:/home/tdhock/R/data.table/src/data.table.h:83: first defined here
...
wrappers.o:/home/tdhock/R/data.table/src/data.table.h:59: multiple definition of `char_integer64'
assign.o:/home/tdhock/R/data.table/src/data.table.h:59: first defined here
collect2: error: ld returned 1 exit status
/home/tdhock/lib/R/share/make/shlib.mk:10: recipe for target 'data.table.so' failed
make: *** [data.table.so] Error 1
ERROR: compilation failed for package ‘data.table’

Does it work for you?

MichaelChirico commented 10 months ago

I get linker errors too, except I get 100s of them

jangorecki commented 10 months ago

The issue has been solved by newer versions of R. I think if we cannot reproduce it on R 4+ we could as well close the issue, rather than trying to examine R code that might have fixed that. As long as results in R < 4 are correct and the issue is speed only.

tdhock commented 9 months ago

I am trying to reproduce, installed R-3.6.3 on windows from https://cloud.r-project.org/bin/windows/base/old/3.6.3/ then installed Rtools35.exe from https://cran.r-project.org/bin/windows/Rtools/history.html then put -std=c99 in my ~/.R/Makevars but I got an error about mman.h not found, does anybody know how to fix that?

th798@cmp2986 ~/R/data.table
$ git checkout 1.11.6
Note: switching to '1.11.6'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at a4e26b50 1.11.6 on CRAN. Bump to 1.11.7

th798@cmp2986 ~/R/data.table
$ R CMD INSTALL .
During startup - Warning message:
Setting LC_CTYPE=en_US.UTF-8 failed
* installing to library 'C:/Users/th798/R/win-library/3.6'
* installing *source* package 'data.table' ...
** using staged installation
** libs

*** arch - i386
c:/Rtools/mingw_32/bin/gcc  -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG       -f
openmp   -Wformat-extra-args -std=c99 -c assign.c -o assign.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc  -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG       -f
openmp   -Wformat-extra-args -std=c99 -c between.c -o between.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc  -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG       -f
openmp   -Wformat-extra-args -std=c99 -c bmerge.c -o bmerge.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc  -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG       -f
openmp   -Wformat-extra-args -std=c99 -c chmatch.c -o chmatch.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc  -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG       -f
openmp   -Wformat-extra-args -std=c99 -c dogroups.c -o dogroups.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc  -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG       -f
openmp   -Wformat-extra-args -std=c99 -c fastmean.c -o fastmean.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc  -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG       -f
openmp   -Wformat-extra-args -std=c99 -c fcast.c -o fcast.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc  -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG       -f
openmp   -Wformat-extra-args -std=c99 -c fmelt.c -o fmelt.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc  -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG       -f
openmp   -Wformat-extra-args -std=c99 -c forder.c -o forder.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc  -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG       -f
openmp   -Wformat-extra-args -std=c99 -c frank.c -o frank.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
c:/Rtools/mingw_32/bin/gcc  -I"C:/PROGRA~1/R/R-36~1.3/include" -DNDEBUG       -f
openmp   -Wformat-extra-args -std=c99 -c fread.c -o fread.o
cc1.exe: warning: -Wformat-extra-args ignored without -Wformat [-Wformat-extra-a
rgs]
fread.c:14:33: fatal error: sys/mman.h: No such file or directory
   #include <sys/mman.h>  // mmap
                                 ^
compilation terminated.
make: *** [fread.o] Error 1
ERROR: compilation failed for package 'data.table'
* removing 'C:/Users/th798/R/win-library/3.6/data.table'
* restoring previous 'C:/Users/th798/R/win-library/3.6/data.table'
tdhock commented 9 months ago

actually this mman.h not found seems to be happening with recent R/rtools too, so I guess I need to figure that separate issue out first.


R Under development (unstable) (2023-11-26 r85638 ucrt) -- "Unsuffered Consequences"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> setwd('c:/Program Files/Emacs/x86_64/bin')
> install.packages("~/R/data.table",repos=NULL,type="source")
* installing *source* package 'data.table' ...
** using staged installation

   **********************************************
   WARNING: this package has a configure script
         It probably needs manual configuration
   **********************************************

** libs
using C compiler: 'gcc.exe (GCC) 12.2.0'
gcc  -I"c:/PROGRA~1/R/R-devel/include" -DNDEBUG     -I"C:/rtools43/x86_64-w64-mingw32.static.posix/include"  -fopenmp   -Wformat-extra-args -std=c99 -c fread.c -o fread.o
cc1.exe: warning: '-Wformat-extra-args' ignored without '-Wformat' [-Wformat-extra-args]
fread.c:16:12: fatal error: sys/mman.h: No such file or directory
   16 |   #include <sys/mman.h>  // mmap
      |            ^~~~~~~~~~~~
compilation terminated.
make: *** [c:/PROGRA~1/R/R-devel/etc/x64/Makeconf:282: fread.o] Error 1
ERROR: compilation failed for package 'data.table'
* removing 'C:/Program Files/R/R-devel/library/data.table'
* restoring previous 'C:/Program Files/R/R-devel/library/data.table'
Warning message:
In install.packages("~/R/data.table", repos = NULL, type = "source") :
  installation of package 'C:\Users\th798/R/data.table' had non-zero exit status
> Sys.which("gcc")
                                       gcc 
"C:\\rtools43\\X86_64~1.POS\\bin\\gcc.exe" 
tdhock commented 9 months ago

the mman.h not found error happens with -std=c99 flag (with current R, or old R).

tdhock commented 9 months ago

Using old R-3.6.3 and gcc 12.3.0 I still get those linker errors. Maybe to reproduce we need an older compiler?