Open MLopez-Ibanez opened 1 year ago
But you aren't even benchmarking the same queries on data.frame
and data.table
since x[index, a]
shouldn't eval in the first place on a data.frame
.
But you aren't even benchmarking the same queries on
data.frame
anddata.table
sincex[index, a]
shouldn't eval in the first place on adata.frame
.
The first call to bench()
is the one benchmarking data.frame
, the second call evaluates data.table
. But OK, just in case of doubt:
library(bench)
df <- data.frame(a = runif(10000), b= as.character(runif(10000)))
index <- runif(10000) <= 0.5
library(data.table)
dt <- as.data.table(df)
mark(
df[index, "a"],
df[["a"]][index],
dt[index, a],
dt[["a"]][index])
Result:
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 "df[index… 78.3µs 81.1µs 11541. 98.3KB 21.7 5319 10 461ms
2 "df[[\"a\… 71.3µs 73.6µs 13177. 98.3KB 23.6 6132 11 465ms
3 "dt[index… 354.3µs 370.4µs 2597. 114.7KB 6.15 1267 3 488ms
4 "dt[[\"a\… 71.2µs 74µs 13165. 98.3KB 23.7 6115 11 464ms
Does data.table
have a benchmarking testsuite to track regressions? Is development frozen? There has been no activity since February and there are more than 1k issues opened and 131 PRs.
Does data.table have a benchmarking testsuite to track regressions? -> no but this is one of the goals in the next 2 years, we plan to hire someone to work on that full time.
I don't think there was a regression here anyway. Looking at the time units, I doubt it will ever get good attention from the team as we would have to subset columns in a loop tens of thousands time for the problem to be really noticeable, and that would be rather uncommon use case.
But you aren't even benchmarking the same queries on
data.frame
anddata.table
sincex[index, a]
shouldn't eval in the first place on adata.frame
.The first call to
bench()
is the one benchmarkingdata.frame
, the second call evaluatesdata.table
. But OK, just in case of doubt:
I'm not implying that the benchmarks are wrong, I'm just saying that a != "a"
. If you would benchmark dt[index, "a"]
then it would still be orders slower than the same call on a data.frame
but it would also be faster than what you do namely comparing different calls.
I'm not implying that the benchmarks are wrong, I'm just saying that
a != "a"
. If you would benchmarkdt[index, "a"]
then it would still be orders slower than the same call on adata.frame
but it would also be faster than what you do namely comparing different calls.
OK, following your suggestion:
library(bench)
df <- data.frame(a = runif(10000), b= as.character(runif(10000)))
index <- runif(10000) <= 0.5
library(data.table)
dt <- as.data.table(df)
mark(
df[index, "a"],
df[["a"]][index],
dt[index, a],
dt[["a"]][index],
as.list(dt[index, "a"])$a)
The result is
# A tibble: 5 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 "df[index… 78.3µs 80.8µs 11308. 97.7KB 19.1 5339 9 472ms
2 "df[[\"a\… 71.5µs 73.5µs 13192. 97.7KB 23.7 6113 11 463ms
3 "dt[index… 357µs 373µs 2629. 114.1KB 6.15 1283 3 488ms
4 "dt[[\"a\… 71.9µs 74.2µs 12873. 97.7KB 23.9 5928 11 461ms
5 "as.list(… 250.2µs 262µs 3729. 129.9KB 10.4 1795 5 481ms
so slightly faster (which is surprising since this is not the recommended way in data.table
's documentation and tutorials), but still 4-5 times slower than a data.frame
I don't this there was a regression here anyway. Looking at the time units, I doubt it will ever get good attention from the team as we would have to subset columns in a loop tens of thousands time for the problem to be really noticeable, and that would be rather uncommon use case.
Is it really an uncommon case to select just one column within a tight loop? The only reason that I noticed this is because [.data.table
shows up in the profiles generated by profvis
for my package (https://github.com/MLopez-Ibanez/irace/tree/datatable-clean), which does not happen for [.data.frame
.
I understand that the primary use of data.table
may be interactive and without loops. However, given the breadth of packages using data.table
, I will not be surprised if people are using data.table
within tight loops. It could be possible to collect profile info in revdeps and filter it to keep only data.table
functions and see which functions are the most expensive ones. My bet would be that [.data.table
is among the top ones (but there may be other slow and popular functions that I have not used).
In any case, given the fact that there were already 1K issues reported before this one, I do not expect a fix to this one anytime soon, but other users may find the information useful or contribute workarounds. I didn't know that dt[index, "a"]
would be faster, so that is already useful to know.
If speed is really important here then you need to avoid method dispatch anyway. If you're only interested in the underlying column then you should be able to use .subset2()
to get the column and then index. E.g.
library(data.table)
df <- data.frame(a = runif(10000), b= as.character(runif(10000)))
index <- runif(10000) <= 0.5
dt <- as.data.table(df)
microbenchmark::microbenchmark(
df[index, "a"],
df[["a"]][index],
.subset2(df, "a")[index],
dt[index, a],
dt[["a"]][index],
as.list(dt[index, "a"])$a,
.subset2(dt, "a")[index]
)
#> Unit: microseconds
#> expr min lq mean median uq
#> df[index, "a"] 60.989 81.5590 94.47888 100.6775 105.0240
#> df[["a"]][index] 55.149 65.9050 85.58422 93.3225 96.5160
#> .subset2(df, "a")[index] 49.975 81.0015 80.79162 85.2605 89.0560
#> dt[index, a] 251.248 275.0495 311.04374 304.2420 312.0940
#> dt[["a"]][index] 55.672 68.5170 85.95876 94.0135 97.7035
#> as.list(dt[index, "a"])$a 172.928 217.2780 326.48526 226.8850 237.7520
#> .subset2(dt, "a")[index] 48.631 77.8495 79.47604 85.7895 88.7160
#> max neval
#> 120.672 100
#> 114.245 100
#> 103.070 100
#> 880.575 100
#> 112.374 100
#> 7082.201 100
#> 98.199 100
Created on 2023-06-16 with reprex v2.0.2
Maybe one could gain 10 microseconds per call but the confidence intervals appear to overlap and it is easier to write dt[["a"]][index]
than .subset2(dt, "a")[index]
.
The issue remains that the "natural" approach of replacing df[index, "a"]
with dt[index, "a"]
, which is the intended use according to the documentation, will produce a slowdown of 200 microseconds per call on average (or x4 slower).
There is PR of mine which already speeds up DT[index], possibly will work with column selection as well. You can try it out.
Not sure yet what is implemented so far, but those are related... for column selection https://github.com/Rdatatable/data.table/pull/4488, and for row selection https://github.com/Rdatatable/data.table/issues/3736 and https://github.com/Rdatatable/data.table/issues/4485
This is rather a problem of optimizing tight loops. If there are many iterations then some care is needed. I documented that (amongst others things to be careful about in tight loops) in
Does data.table have a benchmarking testsuite to track regressions? -> is this a regression? has this code ever worked faster in the past?
I got this result with this code
library(data.table)
atime.list <- atime::atime(
N=10^seq(3,8),
setup={
set.seed(1L)
df <- data.frame(
a = runif(N),
b = as.character(runif(N))
)
index <- runif(N) <= 0.5
dt <- as.data.table(df)
},
"df["=df[index, "a"],
"df[["=df[["a"]][index],
"dt["=dt[index, "a"],
"dt[["=dt[["a"]][index],
verbose=2)
plot(atime.list)
AFAIK it is not a regression. We could call it regression surely if we compare to data.table v1.0.0. Over years interface for processing and optimizing users input was only growing, so it was adding up. Question "does it matter?" is a matter of balance between common usage patterns and absolute time added. And here use case like that get definitely not a high priority.
It has been said multiple times in different places that it is better to vectorize your work on DT rather iteratively run many queries to DT, whenever possible. To avoid overhead of [. This is the cost of user friendliness that DT is packed with. There are some ideas, like with=FALSE
to avoid such overhead, which IMO are step in good direction. The thing is that timing differences of atomic operations are so tiny that it is not getting higher priority.
What would help is complete use case where user suffers at least multiple seconds from that. But then we probably will be able to present a better way which does not suffer from the overhead.
To really get it resolved at the root we need to move [ to C entirely, which IMO is not possible without making it more modular #852.
#
Minimal reproducible example
; please be sure to setverbose=TRUE
where possible!#
Output of sessionInfo()