Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.58k stars 977 forks source link

join operation almost 2 times slower #3928

Open vk111 opened 4 years ago

vk111 commented 4 years ago

HI, I was testing data.table_1.10.4-3 + R version 3.4.0 (2017-04-21) vs data.table_1.12.2 + R version 3.6.1 (2019-07-05)

and have noticed that join operation almost 2 times slower in new version data.table (R?) I think mostly depends on version of data.table

rm(list=ls())

library(data.table)
library(tictoc)

aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

#aa[,  a1 := as.character(a)]
#bb[,  a1 := as.character(a)]

setindex(aa, a)
setindex(bb, a)

tic()
for(i in c(1:1000)) {
 # aa[bb, b := i.b, on=.(a, a1)] # test1
  aa[bb, b := i.b, on=.(a)] # test2
}
toc()

# test1
# 3.6.1: 5.5 sec with index
# 3.6.1: 6.87 sec without index
# 3.4.0: 3.02 sec (no index)

# test2
# 3.6.1: 3.82 sec with index
# 3.6.1: 4.44 sec without index
# 3.4.0: 2.48 sec (no index)
MichaelChirico commented 4 years ago

Have run the following:

# run_join_time.sh
echo 'hash,time' > 'join_timings.csv'

for HASH in $(git rev-list --ancestry-path 1.10.0..1.12.4 | awk 'NR == 1 || NR % 10 == 0');
do git checkout $HASH && R CMD INSTALL . && Rscript time_join.R $HASH;
done
# time_join.R
library(data.table)
library(microbenchmark)

hash = commandArgs(trailingOnly = TRUE)

aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

fwrite(data.table(
  hash = hash,
  time = microbenchmark(times = 1000L, aa[bb, b := i.b, on=.(a)])$time
), 'join_timings.csv', append = TRUE)

Then did the following to create this plot

Screenshot 2019-10-03 at 6 59 11 PM

library(data.table)
times = fread('join_timings.csv')

times[ , date := as.POSIXct(system(sprintf("git show --no-patch --no-notes --pretty='%%cd' %s", .BY$hash), intern = TRUE), by = hash]
times[ , date := as.POSIXct(date, format = '%a %b %d %T %Y %z', tz = 'UTC')]
times[ , date := as.IDate(date)]
setorder(times, date)

tags = system('
for TAG in $(git tag);
  do git show --no-patch --no-notes --pretty="$TAG\t%cd" $TAG;
done
', intern = TRUE)

tags = setDT(tstrsplit(tags, '\t'))
setnames(tags, c('tag', 'date'))
tags[ , date := as.POSIXct(date, format = '%a %b %d %T %Y %z', tz = 'UTC')]
tags[ , date := as.IDate(date)]

times[ , {
  par(mar = c(7.1, 4.1, 2.1, 2.1))
  y = boxplot(I(time/1e6) ~ date, log = 'y', notch = TRUE, pch = '.',
              las = 2L, xlab = '', ylab = 'Execution time (ms)', cex.axis = .8)
  x = unique(date)
  idx = findInterval(tags$date, x)
  abline(v = idx[idx>0L], lty = 2L, col = 'red')
  text(idx[idx>0L], .7*10^(par('usr')[4L]), tags$tag[idx>0L], 
       col = 'red', srt = 90)
  NULL
}]

After starting I realized I could streamline this a bit by using

echo 'hash,time,date' > 'join_timings.csv'

for HASH in $(git rev-list --ancestry-path 1.10.0..1.12.4 | awk 'NR == 1 || NR % 10 == 0');
do git checkout $HASH && R CMD INSTALL . --preclean && Rscript time_join.R $HASH $(git show --no-patch --no-notes --pretty='%cd' $HASH);
done
library(data.table)
library(microbenchmark)

args = commandArgs(trailingOnly = TRUE)
hash = args[1L]
date = as.POSIXct(args[2L], format = '%a %b %d %T %Y %z', tz = 'UTC'))

aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

fwrite(data.table(
  hash = hash,
  date = date,
  time = microbenchmark(times = 1000L, aa[bb, b := i.b, on=.(a)])$time
), 'join_timings.csv', append = TRUE)

to add the commit time directly to the benchmark

join_timings.txt

vk111 commented 4 years ago

Hi MichaelChirico, I have read twice but what does exactly mean the timeline from 2016 to 2019? Is it a version of datatable?

MichaelChirico commented 4 years ago

not just versions but specific commits. let me try and add some demarcation of versions, a good idea I forgot to add in

jangorecki commented 4 years ago

@MichaelChirico your graph looks super cool, what was missing was an interpretation of that :) By guessing I would say that new parallel forder is slower for an input of length 100 than it was before. Which IMO is fine, because these are still microseconds. It is of course worth to reduce the overhead but the priority should be a reduction of resources in the most common usage scenarios. Looping 10000 times on a length 100 input is definitely not one of those.

MichaelChirico commented 4 years ago

that was my guess too after looking at the 1.12.0 NEWS. will try re running single threaded.

vk111 commented 4 years ago

@jangorecki, if you do the simple math it's thousands of microseconds ie milliseconds The inspiration of this test was my current project where I join 7-8 times tables with 40 000 rows Before it takes 18 seconds, after it takes 40 seconds My test also includes character key I installed 3.6+1.12 and then rolled it back to 3.4+1.10 I think this is not normal situation that's why this issue is here

mattdowle commented 4 years ago

Yes the graph is really awesome; we should do more of this and automate it.

But the original report compares R 3.4 to R 3.6, which has R 3.5 in the middle. We know that R 3.5 impacted data.table via ALTREP's change of REAL(), INTEGER() from macros to (slightly heavier) functions.

@vk111 Can you please rerun your tests with v1.12.4 now on CRAN. Any better? We have been taking R API outside loops to cope with R 3.5+. As the next step, it would be great to know how you find v1.12.4. Can you also include more information about your machine; e.g operating system, RAM and number of CPU.

vk111 commented 4 years ago

@matt, will do

vk111 commented 4 years ago

Hi @mattdowle ,

Ive updated to 1.12.4 (+R 3.6.1) - basically there are the same numers for unit test and my func I will consider here only joins with character without indexing - as a benchmark Unit test (as mentioned in the first post) 3.6.1: 6.87 sec without index 3.4.0: 3.02 sec (no index)

My function from project remains the same: 20 vs 40 seconds It consists only ~14 dt joins two tables 30 000 rows (dt1) and 366 000 rows (dt2) 15 columns each key is 2 characters and 1 numeric and assigned 7 fields (from dt1 to dt2) during the join

my PC: Win 10 64 bit AMD A8 2.2 Ghz 8 Gb RAM

cmd: wmic cpu get NumberOfCores, NumberOfLogicalProcessors/Format:List NumberOfCores=4 NumberOfLogicalProcessors=4

Thank you for your analysis

vk111 commented 4 years ago

Hi @mattdowle so are you considering this?

MichaelChirico commented 4 years ago

According to a git bisect I just re-ran, this commit is "guilty" for the slow-down, hopefully that can help Matt pin down the root. That it's related to forder makes me relatively confident we've pinned it down correctly.

https://github.com/Rdatatable/data.table/commit/e59ba149f21bb35098ed0b527505d5ed6c5cb0db

More precisely:

# run_join_time.sh
R CMD INSTALL . && Rscript time_join.R
# time_join.R
library(data.table)
library(microbenchmark)

aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

summary(microbenchmark(times = 1000L, aa[bb, b := i.b, on=.(a)])$time/1e6)

With these two scripts, from master/HEAD:

git checkout 1.12.0
git bisect start
git bisect bad
git bisect good 1.11.8

Then at each iteration, run sh run_join_time.sh and check the result. The bad commits were taking roughly .9 seconds per iteration on my machine, good commits about .7.

vk111 commented 4 years ago

Hi Michael, thank you, but performance has dropped dramatically especially for character key-I provided with the unit test in the very first post Anyway I think we need to wait confirmation from Matt whether it’s fixed in new version or not

On Thu, 26 Dec 2019 at 15:02, Michael Chirico notifications@github.com wrote:

According to a git bisect I just re-ran, this commit is "guilty" for the slow-down, hopefully that can help Matt pin down the root. That it's related to forder makes me relatively confident we've pinned it down correctly.

e59ba14 https://github.com/Rdatatable/data.table/commit/e59ba149f21bb35098ed0b527505d5ed6c5cb0db

More precisely:

run_join_time.sh

R CMD INSTALL . && Rscript time_join.R

time_join.R

library(data.table) library(microbenchmark)

aa = data.table(a = seq(1,100), b = rep(0, 100)) bb = data.table(a = seq(1,100), b = rep(1, 100))

summary(microbenchmark(times = 1000L, aa[bb, b := i.b, on=.(a)])$time/1e6)

With these two scripts, from master/HEAD:

git checkout 1.12.0 git bisect start git bisect bad git bisect good 1.11.8

Then at each iteration, run sh run_join_time.sh and check the result. The bad commits were taking roughly .9 seconds per iteration on my machine, good commits about .7.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/3928?email_source=notifications&email_token=ABRM7MJQITNUZDNYV7NTIX3Q2S2ORA5CNFSM4I4ZPREKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHVTR4Q#issuecomment-569063666, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRM7MOTZRFFM2QKVFUHT6LQ2S2ORANCNFSM4I4ZPREA .

-- Best regards, Vasily Kuznetsov

jangorecki commented 4 years ago

@vk111 if this issue is not closed then you can safely assume the problem is not yet fixed. Whenever we fix something we close related issue.

ColeMiller1 commented 4 years ago

@vk111 you may be able to mitigate with some minor tweaks using setkey instead of setindex. This cuts your test by more than half on my computer:

aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

aa[,  a1 := as.character(a)]
bb[,  a1 := as.character(a)]

tic()
for(i in c(1:1000)) {
  aa[bb, b := i.b, on=.(a, a1)] # test1 without key
}
toc()
## 4 sec elapsed

tic()
setkey(aa, a, a1)
setkey(bb, a, a1)

for(i in c(1:1000)) {
  aa[bb, b := i.b] # test2 without key
}
toc()
## 1.46 sec elapsed

image

library(microbenchmark)
library(data.table)

aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))

aa[,  a1 := as.character(a)]
bb[,  a1 := as.character(a)]

aa_nokey = copy(aa)
bb_nokey = copy(bb)

tic()
aa_int_key = setkey(copy(aa), a)
bb_int_key = setkey(copy(bb), a)
toc()

tic()
aa_int_char = setkey(copy(aa), a, a1)
bb_int_char = setkey(copy(bb), a, a1)
toc()
int_nokey = function(dt1, dt2){
  dt1[dt2, b := i.b, on=.(a)]
}

int_key = function(dt1, dt2){
  dt1[dt2, b := i.b]
}

int_char_nokey = function(dt1, dt2){
  dt1[dt2, b := i.b, on=.(a, a1)]
}

int_char_key = function(dt1, dt2){
  dt1[dt2, b := i.b]
}

int_char_key_on = function(dt1, dt2){
  dt1[dt2, b := i.b, on=.(a, a1)]
}

boxplot(
  microbenchmark(
    a_no_key = int_nokey(aa, bb),
    a_key = int_key(aa_int_key, bb_int_key),
    a_a1_no_key = int_char_nokey(aa, bb),
    a_a1_key = int_char_key(aa_int_char, bb_int_char),
    a_a1_key_on = int_char_key_on(aa_int_char, bb_int_char)
  )
)
vk111 commented 4 years ago

Hi Cole, yes I know about setkey - I provided above the short description of my project and there's no purpose of 1000 joins - there are like 10-20 joins and the thing is that my character strings are very dynamic and I need to make setkey every time before join like 20 times setkey and 20 times joins and therefore the gain in time is less than 2 times I live with 3.4.0 for a while as a general workaround for this situation and probably its not a big deal to fix it in new version especially if there's a guilty commit found By the way if we compare apples to apples I think that 3.4(1.10) will be still faster than 3.6(1.12) if I use the same unit test with the setkey for both versions Happy holidays

On Sun, 29 Dec 2019, 14:48 Cole Miller, notifications@github.com wrote:

@vk111 https://github.com/vk111 you may be able to mitigate with some minor tweaks using setkey instead of setindex. This cuts your test by more than half on my computer:

aa = data.table(a = seq(1,100), b = rep(0, 100)) bb = data.table(a = seq(1,100), b = rep(1, 100))

aa[, a1 := as.character(a)] bb[, a1 := as.character(a)]

tic() for(i in c(1:1000)) { aa[bb, b := i.b, on=.(a, a1)] # test1 without key } toc()

4 sec elapsed

tic() setkey(aa, a, a1) setkey(bb, a, a1)

for(i in c(1:1000)) { aa[bb, b := i.b] # test2 without key } toc()

1.46 sec elapsed

[image: image] https://user-images.githubusercontent.com/57992489/71557665-126d7880-2a17-11ea-8664-21a10deb47ca.png

library(microbenchmark) library(data.table)

aa = data.table(a = seq(1,100), b = rep(0, 100)) bb = data.table(a = seq(1,100), b = rep(1, 100))

aa[, a1 := as.character(a)] bb[, a1 := as.character(a)]

aa_nokey = copy(aa) bb_nokey = copy(bb)

tic() aa_int_key = setkey(copy(aa), a) bb_int_key = setkey(copy(bb), a) toc()

tic() aa_int_char = setkey(copy(aa), a, a1) bb_int_char = setkey(copy(bb), a, a1) toc() int_nokey = function(dt1, dt2){ dt1[dt2, b := i.b, on=.(a)] }

int_key = function(dt1, dt2){ dt1[dt2, b := i.b] }

int_char_nokey = function(dt1, dt2){ dt1[dt2, b := i.b, on=.(a, a1)] }

int_char_key = function(dt1, dt2){ dt1[dt2, b := i.b] }

int_char_key_on = function(dt1, dt2){ dt1[dt2, b := i.b, on=.(a, a1)] }

boxplot( microbenchmark( a_no_key = int_nokey(aa, bb), a_key = int_key(aa_int_key, bb_int_key), a_a1_no_key = int_char_nokey(aa, bb), a_a1_key = int_char_key(aa_int_char, bb_int_char), a_a1_key_on = int_char_key_on(aa_int_char, bb_int_char) ) )

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/3928?email_source=notifications&email_token=ABRM7MPU64YBK7UNT3O3OYDQ3CTBRA5CNFSM4I4ZPREKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHY73VA#issuecomment-569507284, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRM7MKSP6LCUJD2HOAG7D3Q3CTBRANCNFSM4I4ZPREA .

jangorecki commented 4 years ago

There is a pending PR #4370 which is related. It can make join with less overhead that [ normally imposes. It reduces the timing of your example by a factor of 2. The bigger the data, the smaller the improvement, but on the other hand, the more iterations, the better improvement. Note that there is no way to update column during join. You basically have to add columns during join, and then use something like fcoalesce, but that should be performant. It should get some extra speed up once #4386 will also be merged.

library(data.table)
aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))
setindex(aa, a)
setindex(bb, a)
p = proc.time()[[3L]]
for (i in c(1:1000)) {
  aa[bb, b := i.b, on=.(a)]
}
proc.time()[[3L]]-p
#[1] 1.1

mergelist #4370

library(data.table)
aa = data.table(a = seq(1,100), b = rep(0, 100))
bb = data.table(a = seq(1,100), b = rep(1, 100))
setindex(aa, a)
setindex(bb, a)
p = proc.time()[[3L]]
for (i in c(1:1000)) {
  mergelist(list(aa, bb), on="a")
}
proc.time()[[3L]]-p
#[1] 0.696
vk111 commented 4 years ago

Hi Jan, OK thanks

The bigger the data, the smaller the improvement, but on the other hand, the more iterations, the better improvement The real issue is related to big data and a few iterations - I wrote it above

On Mon, 4 May 2020, 00:33 Jan Gorecki, notifications@github.com wrote:

There is a pending PR #4370 https://github.com/Rdatatable/data.table/pull/4370 which is related. It can make join with less overhead that [ normally imposes. It reduces the timing of your example by a factor of 2. The bigger the data, the smaller the improvement, but on the other hand, the more iterations, the better improvement. Note that there is no way to update column during join. You basically have to add columns during join, and then use something like fcoalesce, but that should be performant. It should get some extra speed up once #4386 https://github.com/Rdatatable/data.table/pull/4386 will also be merged.

library(data.table)aa = data.table(a = seq(1,100), b = rep(0, 100))bb = data.table(a = seq(1,100), b = rep(1, 100)) setindex(aa, a) setindex(bb, a)p = proc.time()[[3L]]for (i in c(1:1000)) { aa[bb, b := i.b, on=.(a)] } proc.time()[[3L]]-p#[1] 1.1

mergelist #4370 https://github.com/Rdatatable/data.table/pull/4370

library(data.table)aa = data.table(a = seq(1,100), b = rep(0, 100))bb = data.table(a = seq(1,100), b = rep(1, 100)) setindex(aa, a) setindex(bb, a)p = proc.time()[[3L]]for (i in c(1:1000)) { mergelist(list(aa, bb), on="a") } proc.time()[[3L]]-p#[1] 0.696

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/3928#issuecomment-623193573, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRM7MJR5THRHE3FIXXV3YTRPXWNNANCNFSM4I4ZPREA .

jangorecki commented 4 years ago

I checked if #4440 (merged yesterday) resolves this issue, but it look it doesn't. Although timings I am getting here are quite different from the timings reported by @vk111, so if you are able to test recent development version, and see if it makes any difference for your workflow that would be helpful. As you are on Windows, there is another pending PR that could eventually help #4558.


This is the common script I run in each environment:

library(data.table)
setDTthreads(2) ## vk111 has 4 cores
aa = data.table(a = seq(1,100), b = rep(0, 100))[,  a1 := as.character(a)]
bb = data.table(a = seq(1,100), b = rep(1, 100))[,  a1 := as.character(a)]
setindex(aa, a); setindex(bb, a)
cat("test1\n")
system.time({
for(i in c(1:1000)) {
  aa[bb, b := i.b, on=.(a, a1)] # test1
}})
cat("test2\n")
system.time({
for(i in c(1:1000)) {
  aa[bb, b := i.b, on=.(a)] # test2
}})

And these are environments:

docker run -it --rm r-base:3.4.0
install.packages("https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.10.4-3.tar.gz", repos=NULL)
test1
   user  system elapsed 
  0.949   0.007   0.957 
test2
   user  system elapsed 
  0.864   0.000   0.864 
docker run -it --rm r-base:3.6.1
install.packages("https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.12.2.tar.gz", repos=NULL)
test1
   user  system elapsed 
  2.598   0.047   1.350 
test2
   user  system elapsed 
  2.092   0.016   1.062 
docker run -it --rm r-base:3.6.1
install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table")
test1
   user  system elapsed 
  1.174   0.045   1.219 
test2
   user  system elapsed 
  0.981   0.024   1.004 
vk111 commented 4 years ago

Hi, I am on Windows

On Fri, 26 Jun 2020, 13:19 Jan Gorecki, notifications@github.com wrote:

I checked if #4440 https://github.com/Rdatatable/data.table/pull/4440 resolve this issue, but it looks it didn't. Although timings I am getting here are quite different from the timings reported by @vk111 https://github.com/vk111, so if you are able to test recent development version, and see if it makes any difference for your workflow that would be helpful. This is the common script I run in each environment:

library(data.table) setDTthreads(2) ## vk111 has 4 coresaa = data.table(a = seq(1,100), b = rep(0, 100))[, a1 := as.character(a)]bb = data.table(a = seq(1,100), b = rep(1, 100))[, a1 := as.character(a)] setindex(aa, a); setindex(bb, a) cat("test1\n") system.time({for(i in c(1:1000)) { aa[bb, b := i.b, on=.(a, a1)] # test1 }}) cat("test2\n") system.time({for(i in c(1:1000)) { aa[bb, b := i.b, on=.(a)] # test2 }})

And these are environments:

docker run -it --rm r-base:3.4.0 install.packages("https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.10.4-3.tar.gz", repos=NULL) test1 user system elapsed 0.949 0.007 0.957 test2 user system elapsed 0.864 0.000 0.864

docker run -it --rm r-base:3.6.1 install.packages("https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.12.2.tar.gz", repos=NULL) test1 user system elapsed 2.598 0.047 1.350 test2 user system elapsed 2.092 0.016 1.062

docker run -it --rm r-base:3.6.1 install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table") test1 user system elapsed 1.174 0.045 1.219 test2 user system elapsed 0.981 0.024 1.004

As you are on Windows, there is another pending PR that could eventually help #4558 https://github.com/Rdatatable/data.table/pull/4558.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/3928#issuecomment-650126662, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRM7MJLU3XRN5YN37ZVLALRYR727ANCNFSM4I4ZPREA .

jangorecki commented 4 years ago

@vk111 Yes, exactly what I was asking about. If you could check those 3 conifugrations on your Windows machine and see if there is any progress for you.

vk111 commented 4 years ago

But as I understood pr for Windows is not merged yet

On Fri, 26 Jun 2020, 20:07 Jan Gorecki, notifications@github.com wrote:

@vk111 https://github.com/vk111 Yes, exactly what I was asking about. If you could check those 3 conifugrations on your Windows machine and see if there is any progress for you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/3928#issuecomment-650320224, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRM7MMDOCNAWARCCDRSU23RYTPVFANCNFSM4I4ZPREA .

jangorecki commented 4 years ago

4440 is already merged and helps for every OS. #4558 is still pending to merge.

@vk111 #4558 has been already merged as well

vk111 commented 4 years ago

Hi Jan, I took your advice recently - #4419 https://github.com/Rdatatable/data.table/issues/4419 related to my issue so I dont have this exact piece of code with the characters joins But anyway thank you for this improvement I am sure it will help a lot of people in the future if you believe it has the better performance then I guess you should just merge it Thank you

пт, 26 июн. 2020 г. в 20:29, Jan Gorecki notifications@github.com:

4440 https://github.com/Rdatatable/data.table/pull/4440 is already

merged and help for every OS. #4558 https://github.com/Rdatatable/data.table/pull/4558 is still pending to merge.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/3928#issuecomment-650330154, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRM7MIN2P7BXMKG74Z7EKTRYTSIJANCNFSM4I4ZPREA .

-- Best regards, Vasily Kuznetsov

jangorecki commented 4 years ago

@vk111 mentioned PRs are already merged. The thing is to know if this issue got resolved, or at least improved. Those merged PRs were not attempting to resolve this issue, but they are related so they might have been helpful.

jangorecki commented 4 years ago

It happens that the initially reported minimal example was poorly exposing the issue. Based on the @vk111 https://github.com/Rdatatable/data.table/issues/3928#issuecomment-538681316 I constructed exmaple which clearly shows there is an issue here.

install.packages("https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.10.4-3.tar.gz", repos=NULL)
install.packages("https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.12.2.tar.gz", repos=NULL)
install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table")
library(data.table)
setDTthreads(2L)
set.seed(108)
n2 = 1:366000
n1 = sample(n2, 30000)
d1=data.table(id1=paste0("id",sample(n1)), id2=paste0("id",sample(n1)), v1=n1)
d2=data.table(id1=paste0("id",sample(n2)), id2=paste0("id",sample(n2)), v2=n2)
system.time(d2[d1, v1 := i.v1, on=c("id1","id2")])
# 3.4.0: 1.10.4-3
   user  system elapsed 
  0.144   0.000   0.144 
# 3.4.0: 1.12.2
   user  system elapsed 
  0.611   0.003   0.451 
# 3.4.0: 1.12.9
   user  system elapsed 
  0.611   0.000   0.452 
# 4.0.0: 1.10.4-3
   user  system elapsed 
  0.203   0.008   0.211 
# 4.0.0: 1.12.9
   user  system elapsed 
  0.812   0.004   0.644 

verbose shows that most of time is spent Calculated ad hoc index in 0.611s elapsed (0.769s cpu), which is call to forderv from bmerge.

vk111 commented 4 years ago

Yes, exactly But it seems there is no improvement in this tests?

On Tue, 30 Jun 2020, 00:45 Jan Gorecki, notifications@github.com wrote:

It happens that the initially reported minimal example was poorly exposing the issue. Based on the @vk111 https://github.com/vk111 #3928 (comment) https://github.com/Rdatatable/data.table/issues/3928#issuecomment-538681316 I constructed exmaple which clearly shows there is an issue here.

library(data.table) setDTthreads(2L) set.seed(108)n2 = 1:366000n1 = sample(n2, 30000)d1=data.table(id1=paste0("id",sample(n1)), id2=paste0("id",sample(n1)), v1=n1)d2=data.table(id1=paste0("id",sample(n2)), id2=paste0("id",sample(n2)), v2=n2) system.time(d2[d1, v1 := i.v1, on=c("id1","id2")])

3.4.0: 1.10.4-3

user system elapsed 0.144 0.000 0.144

3.4.0: 1.12.2

user system elapsed 0.611 0.003 0.451

3.4.0: 1.12.9

user system elapsed 0.611 0.000 0.452

4.0.0: 1.12.9

user system elapsed 0.812 0.004 0.644

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/3928#issuecomment-651408089, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRM7MPQHSOEGLXS625BU3TRZEKSBANCNFSM4I4ZPREA .

jangorecki commented 4 years ago

No improvement, this is still an issue. After some digging I found the bottleneck is this call https://github.com/Rdatatable/data.table/blob/ba32f3cba38ec270587e395f6e6c26a80be36be6/src/forder.c#L282 going further... this loop is responsible for 60% https://github.com/Rdatatable/data.table/blob/ba32f3cba38ec270587e395f6e6c26a80be36be6/src/forder.c#L242-L245 and those for 30% https://github.com/Rdatatable/data.table/blob/ba32f3cba38ec270587e395f6e6c26a80be36be6/src/forder.c#L252-L258

I tried checking if using R internals (CHAR() a macro rather than function) would help here, but it does not help.


Even more minimal example, timing forderv only, on R 4.0 only

install.packages("https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.10.4-3.tar.gz", repos=NULL)
library(data.table)
setDTthreads(1L)
set.seed(108)
n2 = 1:366000
d2=data.table(id1=paste0("id",sample(n2)), id2=paste0("id",sample(n2)), v2=n2)
system.time(data.table:::forderv(d2, c("id1","id2")))
install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table")
library(data.table)
setDTthreads(1L)
set.seed(108)
n2 = 1:366000
d2=data.table(id1=paste0("id",sample(n2)), id2=paste0("id",sample(n2)), v2=n2)
system.time(data.table:::forderv(d2, c("id1","id2")))
# 4.0.2: 1.10.4-3
   user  system elapsed 
  0.198   0.000   0.198 
# 4.0.2: 1.12.9
   user  system elapsed 
  0.393   0.000   0.392 
mattdowle commented 4 years ago

I tried checking if using R internals (CHAR() a macro rather than function) would help here, but it does not help.

That's because R doesn't do that anymore. With USE_RINTERNALS, INTEGER, REAL and I assume CHAR too, are still inline functions. In other words, the benefit (i.e. macro instead of function) of USE_RINTERNALS was removed in recent versions of R to permit ALTREP. I don't believe it is really whether it is macro or function or inline function, per se, that makes the difference. It's what is being done by R inside that unit of work. In the old days, USE_RINTERNALS made it a macro and it was a direct memory lookup (an array dereference just taking account of the vector's header) which is why it was faster. It was a very lightweight unit of work. These days, R has at least one branch to check for ALTREP and, even with branch prediction, that extra work adds up and/or prevents prefetch. In other words, R now has a little bit more code inside these inline functions. I've no benchmark evidence, just theory based on looking at it.

jangorecki commented 3 years ago

Root cause might be removal of the following branch in cradix_r https://github.com/Rdatatable/data.table/blob/76bb569fd7736b5f89471a35357e6a971ae1d424/src/forder.c#L780-L783 and now for input of length 2 there is no short-circuit. It is not obvious how to put that branch now because function has changed a lot in terms of how it operates.