Open matthiasgomolka opened 3 years ago
Interestingly, this issue does not occur on my Linux machine (see below). I'll investigate further.
library(data.table)
library(dplyr)
#>
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(microbenchmark)
dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))
microbenchmark(
data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 34.59378 34.59378 34.59378 34.59378 34.59378 34.59378 1
#> dplyr 34.98623 34.98623 34.98623 34.98623 34.98623 34.98623 1
setkey(dfa_char, id)
setkey(dfb_char, id)
microbenchmark(
data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 3.570949 3.570949 3.570949 3.570949 3.570949 3.570949 1
#> dplyr 14.952365 14.952365 14.952365 14.952365 14.952365 14.952365 1
dfa_int <- data.table(id = seq_len(10000))
dfb_int <- data.table(id = seq_len(100000), val = rnorm(100000))
microbenchmark(
data.table = merge.data.table(dfa_int, dfb_int, by = "id", sort = FALSE),
dplyr = inner_join(dfa_int, dfb_int, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 2.410635 2.410635 2.410635 2.410635 2.410635 2.410635 1
#> dplyr 8.584089 8.584089 8.584089 8.584089 8.584089 8.584089 1
Created on 2020-10-02 by the reprex package (v0.3.0)
Could you use times=2
? It may be caused by https://github.com/Rdatatable/data.table/issues/2912 which on Windows can be even more severe.
Also note that merge.data.table
just wraps around [
so for benchmarking the latter should be preferred.
dfa_char[dfb_char, on = "id", nomatch = NULL]
Reinstalling all packages did not have an effect on Windows. Is this related to the old R version (3.6.1)?
Reinstalling won't make difference.
Did you try times=2L
, does the timings differ?
Also note that there is an open issue that your issue would probably be a duplicate of https://github.com/Rdatatable/data.table/issues/3928
Sorry, somehow missed your first comment before. Here are the benchmarks with times = 2
:
library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(microbenchmark)
#> Warning: package 'microbenchmark' was built under R version 3.6.2
n = 2L # run benchmarks twice
dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))
microbenchmark(
data.table = dfa_char[dfb_char, on = "id", nomatch = NULL],
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = n
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 1698.4893 1698.4893 1758.1653 1758.1653 1817.8414 1817.8414 2
#> dplyr 41.1121 41.1121 44.3698 44.3698 47.6275 47.6275 2
Created on 2020-10-02 by the reprex package (v0.3.0)
I reinstalled the packages because I once had an issue with packages compiled for another R version...
I have Windows 10 / data.table 1.13.0 / R 4.0.2. dplyr
is somewhat faster for character fields on my machine but not the large difference that @Plebejer is experiencing.
library(data.table)
library(dplyr)
library(microbenchmark)
n = 10L # run benchmarks twice
dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))
microbenchmark(
data.table = dfa_char[dfb_char, on = "id", nomatch = NULL],
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = n
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 55.1987 59.1335 64.47342 60.95335 63.0049 86.6976 10
#> dplyr 36.0711 36.8935 43.36965 42.81990 48.2768 54.2212 10
I also tried it with varying threads between 1-8. There were no large effects on timings.
According to my test:
Using setkey
on both dfa
and dfb
changes the dynamic and makes data.table
about 3x faster than dplyr
on macOS.
library(data.table)
library(dplyr)
library(microbenchmark)
n = 2L
dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))
microbenchmark(
data.table = dfa_char[dfb_char, on = "id", nomatch = NULL],
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = n
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 36.99652 36.99652 42.68058 42.68058 48.36465 48.36465 2
# dplyr 16.11978 16.11978 29.98812 29.98812 43.85645 43.85645 2
setkey(dfa_char, id)
setkey(dfb_char, id)
microbenchmark(
data.table = dfa_char[dfb_char, on = "id", nomatch = NULL],
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = n
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 5.618789 5.618789 6.102372 6.102372 6.585954 6.585954 2
# dplyr 15.181681 15.181681 17.706647 17.706647 20.231614 20.231614 2
sessionInfo
R version 4.0.2 Patched (2020-09-21 r79235)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4-7 dplyr_1.0.0 data.table_1.13.1
[4] pacman_0.5.1
loaded via a namespace (and not attached):
[1] crayon_1.3.4 R6_2.4.1 lifecycle_0.2.0 magrittr_1.5
[5] pillar_1.4.6 rlang_0.4.7 vctrs_0.3.2 generics_0.0.2
[9] ellipsis_0.3.1 glue_1.4.1 purrr_0.3.4 compiler_4.0.2
[13] pkgconfig_2.0.3 tidyselect_1.1.0 tibble_3.0.3
It is because we don't have to find order of a character column anymore. Issue I linked describes the problem. I might have found a root cause of it already and shared that to Matt over slack. If we will not have it resolved for the release that is coming now, then I hope the one after will have this fix.
I described root cause in linked issue. While I think this issue is a duplicate of the other one, it is more safe to keep it open to ensure it will be resolved as well.
Also interesting finding by @shrektan is that R must have improved character handling on windows in R 4.0.
I noticed that merging data.tables on a
character
column is surprisingly slow (compared todplyr
). See the reprex below.Am I missing something obvious or is there room for improvement? Don't get me wrong: I love
data.table
and use it daily, but I was very surprised that this kind of merge is so comparatively slow.Minimal reproducible example
Created on 2020-10-02 by the reprex package (v0.3.0)
Output of
sessionInfo()
PS: Not sure if this is already adressed by #4538.