Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.59k stars 977 forks source link

`[.data.table` is very slow with a single column #5650

Open MLopez-Ibanez opened 1 year ago

MLopez-Ibanez commented 1 year ago

# Minimal reproducible example; please be sure to set verbose=TRUE where possible!

library(bench)
x <- data.frame(a = runif(10000), b= as.character(runif(10000)))
index <- runif(10000) <= 0.5
mark(x[index, "a"], x[["a"]][index])
# A tibble: 2 × 13
#    expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#    <bch:expr>   <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 "x[index, \… 78.1µs 80.6µs    11587.    98.4KB     21.2  5469    10      472ms
# 2 "x[[\"a\"]]…   71µs 73.2µs    13240.    98.4KB     23.5  6207    11      469ms
# More or less the same !
library(data.table)
x <- as.data.table(x)
mark(x[index, a], x[["a"]][index])
# A tibble: 2 × 13
#   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 "x[index,… 360.4µs 377.1µs     2403.   219.3KB     8.28  1161     4      483ms
# 2 "x[[\"a\"…  71.4µs  73.9µs    12920.    98.4KB    23.9   5949    11      460ms
# Five times slower !!!

# Output of sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=es_ES.utf8        LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.2 bench_1.1.3      

loaded via a namespace (and not attached):
 [1] matrixStats_0.61.0 fansi_1.0.2        utf8_1.2.2         irace_3.5.1.9000  
 [5] R6_2.5.1           lifecycle_1.0.3    magrittr_2.0.2     pillar_1.9.0      
 [9] profmem_0.6.0      rlang_1.0.6        cli_3.6.0          vctrs_0.5.2       
[13] tools_4.1.2        glue_1.6.2         compiler_4.1.2     pkgconfig_2.0.3   
[17] tibble_3.2.1
ben-schwen commented 1 year ago

But you aren't even benchmarking the same queries on data.frame and data.table since x[index, a] shouldn't eval in the first place on a data.frame.

MLopez-Ibanez commented 1 year ago

But you aren't even benchmarking the same queries on data.frame and data.table since x[index, a] shouldn't eval in the first place on a data.frame.

The first call to bench() is the one benchmarking data.frame, the second call evaluates data.table. But OK, just in case of doubt:

library(bench)
df <- data.frame(a = runif(10000), b= as.character(runif(10000)))
index <- runif(10000) <= 0.5
library(data.table)
dt <- as.data.table(df)
mark(
  df[index, "a"],
  df[["a"]][index],
  dt[index, a],
  dt[["a"]][index])

Result:

# A tibble: 4 × 13
  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 "df[index…  78.3µs  81.1µs    11541.    98.3KB    21.7   5319    10      461ms
2 "df[[\"a\…  71.3µs  73.6µs    13177.    98.3KB    23.6   6132    11      465ms
3 "dt[index… 354.3µs 370.4µs     2597.   114.7KB     6.15  1267     3      488ms
4 "dt[[\"a\…  71.2µs    74µs    13165.    98.3KB    23.7   6115    11      464ms

Does data.table have a benchmarking testsuite to track regressions? Is development frozen? There has been no activity since February and there are more than 1k issues opened and 131 PRs.

tdhock commented 1 year ago

Does data.table have a benchmarking testsuite to track regressions? -> no but this is one of the goals in the next 2 years, we plan to hire someone to work on that full time.

jangorecki commented 1 year ago

I don't think there was a regression here anyway. Looking at the time units, I doubt it will ever get good attention from the team as we would have to subset columns in a loop tens of thousands time for the problem to be really noticeable, and that would be rather uncommon use case.

ben-schwen commented 1 year ago

But you aren't even benchmarking the same queries on data.frame and data.table since x[index, a] shouldn't eval in the first place on a data.frame.

The first call to bench() is the one benchmarking data.frame, the second call evaluates data.table. But OK, just in case of doubt:

I'm not implying that the benchmarks are wrong, I'm just saying that a != "a". If you would benchmark dt[index, "a"] then it would still be orders slower than the same call on a data.frame but it would also be faster than what you do namely comparing different calls.

MLopez-Ibanez commented 1 year ago

I'm not implying that the benchmarks are wrong, I'm just saying that a != "a". If you would benchmark dt[index, "a"] then it would still be orders slower than the same call on a data.frame but it would also be faster than what you do namely comparing different calls.

OK, following your suggestion:

library(bench)
df <- data.frame(a = runif(10000), b= as.character(runif(10000)))
index <- runif(10000) <= 0.5
library(data.table)
dt <- as.data.table(df)
mark(
  df[index, "a"],
  df[["a"]][index],
  dt[index, a],
  dt[["a"]][index],
  as.list(dt[index, "a"])$a)

The result is

# A tibble: 5 × 13
  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 "df[index…  78.3µs  80.8µs    11308.    97.7KB    19.1   5339     9      472ms
2 "df[[\"a\…  71.5µs  73.5µs    13192.    97.7KB    23.7   6113    11      463ms
3 "dt[index…   357µs   373µs     2629.   114.1KB     6.15  1283     3      488ms
4 "dt[[\"a\…  71.9µs  74.2µs    12873.    97.7KB    23.9   5928    11      461ms
5 "as.list(… 250.2µs   262µs     3729.   129.9KB    10.4   1795     5      481ms

so slightly faster (which is surprising since this is not the recommended way in data.table's documentation and tutorials), but still 4-5 times slower than a data.frame

I don't this there was a regression here anyway. Looking at the time units, I doubt it will ever get good attention from the team as we would have to subset columns in a loop tens of thousands time for the problem to be really noticeable, and that would be rather uncommon use case.

Is it really an uncommon case to select just one column within a tight loop? The only reason that I noticed this is because [.data.table shows up in the profiles generated by profvis for my package (https://github.com/MLopez-Ibanez/irace/tree/datatable-clean), which does not happen for [.data.frame.

I understand that the primary use of data.table may be interactive and without loops. However, given the breadth of packages using data.table, I will not be surprised if people are using data.table within tight loops. It could be possible to collect profile info in revdeps and filter it to keep only data.table functions and see which functions are the most expensive ones. My bet would be that [.data.table is among the top ones (but there may be other slow and popular functions that I have not used).

In any case, given the fact that there were already 1K issues reported before this one, I do not expect a fix to this one anytime soon, but other users may find the information useful or contribute workarounds. I didn't know that dt[index, "a"] would be faster, so that is already useful to know.

TimTaylor commented 1 year ago

If speed is really important here then you need to avoid method dispatch anyway. If you're only interested in the underlying column then you should be able to use .subset2() to get the column and then index. E.g.

library(data.table)
df <- data.frame(a = runif(10000), b= as.character(runif(10000)))
index <- runif(10000) <= 0.5
dt <- as.data.table(df)
microbenchmark::microbenchmark(
  df[index, "a"],
  df[["a"]][index],
  .subset2(df, "a")[index],
  dt[index, a],
  dt[["a"]][index],
  as.list(dt[index, "a"])$a,
  .subset2(dt, "a")[index]
)
#> Unit: microseconds
#>                       expr     min       lq      mean   median       uq
#>             df[index, "a"]  60.989  81.5590  94.47888 100.6775 105.0240
#>           df[["a"]][index]  55.149  65.9050  85.58422  93.3225  96.5160
#>   .subset2(df, "a")[index]  49.975  81.0015  80.79162  85.2605  89.0560
#>               dt[index, a] 251.248 275.0495 311.04374 304.2420 312.0940
#>           dt[["a"]][index]  55.672  68.5170  85.95876  94.0135  97.7035
#>  as.list(dt[index, "a"])$a 172.928 217.2780 326.48526 226.8850 237.7520
#>   .subset2(dt, "a")[index]  48.631  77.8495  79.47604  85.7895  88.7160
#>       max neval
#>   120.672   100
#>   114.245   100
#>   103.070   100
#>   880.575   100
#>   112.374   100
#>  7082.201   100
#>    98.199   100

Created on 2023-06-16 with reprex v2.0.2

MLopez-Ibanez commented 1 year ago

Maybe one could gain 10 microseconds per call but the confidence intervals appear to overlap and it is easier to write dt[["a"]][index] than .subset2(dt, "a")[index].

The issue remains that the "natural" approach of replacing df[index, "a"] with dt[index, "a"], which is the intended use according to the documentation, will produce a slowdown of 200 microseconds per call on average (or x4 slower).

jangorecki commented 1 year ago

There is PR of mine which already speeds up DT[index], possibly will work with column selection as well. You can try it out.

jangorecki commented 1 year ago

Not sure yet what is implemented so far, but those are related... for column selection https://github.com/Rdatatable/data.table/pull/4488, and for row selection https://github.com/Rdatatable/data.table/issues/3736 and https://github.com/Rdatatable/data.table/issues/4485

jangorecki commented 12 months ago

This is rather a problem of optimizing tight loops. If there are many iterations then some care is needed. I documented that (amongst others things to be careful about in tight loops) in

https://github.com/Rdatatable/data.table/blob/21da019f63947c28f9cb93fdf8f275bd52461dd5/man/frollapply.Rd#L218-L225

tdhock commented 12 months ago

Does data.table have a benchmarking testsuite to track regressions? -> is this a regression? has this code ever worked faster in the past?

tdhock commented 12 months ago

I got this result image with this code

library(data.table)
atime.list <- atime::atime(
  N=10^seq(3,8),
  setup={
    set.seed(1L)
    df <- data.frame(
      a = runif(N),
      b = as.character(runif(N))
    )
    index <- runif(N) <= 0.5
    dt <- as.data.table(df)
  },
  "df["=df[index, "a"],
  "df[["=df[["a"]][index],
  "dt["=dt[index, "a"],
  "dt[["=dt[["a"]][index],
  verbose=2)
plot(atime.list)
jangorecki commented 12 months ago

AFAIK it is not a regression. We could call it regression surely if we compare to data.table v1.0.0. Over years interface for processing and optimizing users input was only growing, so it was adding up. Question "does it matter?" is a matter of balance between common usage patterns and absolute time added. And here use case like that get definitely not a high priority.

It has been said multiple times in different places that it is better to vectorize your work on DT rather iteratively run many queries to DT, whenever possible. To avoid overhead of [. This is the cost of user friendliness that DT is packed with. There are some ideas, like with=FALSE to avoid such overhead, which IMO are step in good direction. The thing is that timing differences of atomic operations are so tiny that it is not getting higher priority. What would help is complete use case where user suffers at least multiple seconds from that. But then we probably will be able to present a better way which does not suffer from the overhead. To really get it resolved at the root we need to move [ to C entirely, which IMO is not possible without making it more modular #852.

MichaelChirico commented 12 months ago

4687 is the FR for continuous benchmarking