Open egillax opened 1 year ago
This script is helpful, it reproduces the issue quite clearly. It seems the issue is definitely in the join code. A flame chart shows that significantly more time is spent both building and searching the hash table. The size of the key shouldn't really matter in this case since these operations are working on the hash which is always 32-bit.
More concretely, the arrow::compute::SwissTable::search_block
function is called an order of magnitude more times when working with numeric data. Based on my (admittedly rough) understanding I believe this suggests a less even distribution across hash buckets when working with numeric data.
This fix is going to require someone motivated to really dig in and understand how the hash table works.
Describe the bug, including details regarding any error messages, version, and platform.
When joining and the columns are
numeric
the performance scales really badly with size. I was noticing a huge speed difference when joining columns onnumeric
columns vsint32/int64
columns when by mistake my columns werenumeric
. This is not the case fordplyr
.I made the following figure by creating arrow tables of increasing sizes and joined them with another table a third of the size (see bottom for code):
As can be seen when join columns are
numeric
it scales much worse than the other ones. I'm not even sure it makes sense to join on numerics... but I think this could be a bug. Or at least something worth investigating.Code to reproduce
```R sizes <- c(1e3, 1e4, 1e5, 1e6) results <- data.frame() for (size in sizes) { size1 <- size size2 <- floor(size/3) # int 32 id column dfInt1 <- data.frame(id=sample(1:size1, replace=F), value=runif(size1)) dfInt2 <- data.frame(id=sample(1:size2, replace=F), value2=runif(size2)) arrowInt1 <- arrow::as_arrow_table(dfInt1) arrowInt2 <- arrow::as_arrow_table(dfInt2) # int 64 id column type <- bit64::as.integer64 dfInt64_1 <- data.frame(id=type(sample(1:size1, replace=F)), value=runif(size1)) dfInt64_2 <- data.frame(id=type(sample(1:size2, replace=F)), value2=runif(size2)) arrowInt64_1 <- arrow::as_arrow_table(dfInt64_1) arrowInt64_2 <- arrow::as_arrow_table(dfInt64_2) # numeric id column type <- as.numeric dfNum1 <- data.frame(id=type(sample(1:size1, replace=F)), value=runif(size1)) dfNum2 <- data.frame(id=type(sample(1:size2, replace=F)), value2=runif(size2)) arrowNum1 <- arrow::as_arrow_table(dfNum1) arrowNum2 <- arrow::as_arrow_table(dfNum2) arrowResultsInt <- arrowInt1 |> dplyr::inner_join(arrowInt2, by='id') |> dplyr::compute() arrowResultsInt64 <- arrowInt64_1 |> dplyr::inner_join(arrowInt64_2, by='id') |> dplyr::compute() arrowResultsNum <- arrowNum1 |> dplyr::inner_join(arrowNum2, by='id') |> dplyr::compute() res <- microbenchmark::microbenchmark(arrowInt32=arrowInt1 |> dplyr::inner_join(arrowInt2, by='id') |> dplyr::compute(), arrowInt64=arrowInt64_1 |> dplyr::inner_join(arrowInt64_2, by='id') |> dplyr::compute(), arrowNum=arrowNum1 |> dplyr::inner_join(arrowNum2, by='id') |> dplyr::compute(), dplyrNum=dfNum1 |> dplyr::inner_join(dfNum2, by='id'), dplyrInt=dfInt1 |> dplyr::inner_join(dfInt2, by='id'), check = NULL, times=10 ) res <- summary(res) res$size <- size results <- rbind(res, results) } ggplot2::ggplot(data=results, ggplot2::aes(x=size, y=mean, group=expr, color=expr)) + ggplot2::geom_line() + ylab('mean time (ms)') ```Component(s)
R