apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.5k stars 3.52k forks source link

[C++] join with numeric columns has poor performance #35334

Open egillax opened 1 year ago

egillax commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

When joining and the columns are numeric the performance scales really badly with size. I was noticing a huge speed difference when joining columns on numeric columns vs int32/int64 columns when by mistake my columns were numeric. This is not the case for dplyr.

I made the following figure by creating arrow tables of increasing sizes and joined them with another table a third of the size (see bottom for code):

image

As can be seen when join columns are numeric it scales much worse than the other ones. I'm not even sure it makes sense to join on numerics... but I think this could be a bug. Or at least something worth investigating.

Code to reproduce ```R sizes <- c(1e3, 1e4, 1e5, 1e6) results <- data.frame() for (size in sizes) { size1 <- size size2 <- floor(size/3) # int 32 id column dfInt1 <- data.frame(id=sample(1:size1, replace=F), value=runif(size1)) dfInt2 <- data.frame(id=sample(1:size2, replace=F), value2=runif(size2)) arrowInt1 <- arrow::as_arrow_table(dfInt1) arrowInt2 <- arrow::as_arrow_table(dfInt2) # int 64 id column type <- bit64::as.integer64 dfInt64_1 <- data.frame(id=type(sample(1:size1, replace=F)), value=runif(size1)) dfInt64_2 <- data.frame(id=type(sample(1:size2, replace=F)), value2=runif(size2)) arrowInt64_1 <- arrow::as_arrow_table(dfInt64_1) arrowInt64_2 <- arrow::as_arrow_table(dfInt64_2) # numeric id column type <- as.numeric dfNum1 <- data.frame(id=type(sample(1:size1, replace=F)), value=runif(size1)) dfNum2 <- data.frame(id=type(sample(1:size2, replace=F)), value2=runif(size2)) arrowNum1 <- arrow::as_arrow_table(dfNum1) arrowNum2 <- arrow::as_arrow_table(dfNum2) arrowResultsInt <- arrowInt1 |> dplyr::inner_join(arrowInt2, by='id') |> dplyr::compute() arrowResultsInt64 <- arrowInt64_1 |> dplyr::inner_join(arrowInt64_2, by='id') |> dplyr::compute() arrowResultsNum <- arrowNum1 |> dplyr::inner_join(arrowNum2, by='id') |> dplyr::compute() res <- microbenchmark::microbenchmark(arrowInt32=arrowInt1 |> dplyr::inner_join(arrowInt2, by='id') |> dplyr::compute(), arrowInt64=arrowInt64_1 |> dplyr::inner_join(arrowInt64_2, by='id') |> dplyr::compute(), arrowNum=arrowNum1 |> dplyr::inner_join(arrowNum2, by='id') |> dplyr::compute(), dplyrNum=dfNum1 |> dplyr::inner_join(dfNum2, by='id'), dplyrInt=dfInt1 |> dplyr::inner_join(dfInt2, by='id'), check = NULL, times=10 ) res <- summary(res) res$size <- size results <- rbind(res, results) } ggplot2::ggplot(data=results, ggplot2::aes(x=size, y=mean, group=expr, color=expr)) + ggplot2::geom_line() + ylab('mean time (ms)') ```

Component(s)

R

westonpace commented 1 year ago

This script is helpful, it reproduces the issue quite clearly. It seems the issue is definitely in the join code. A flame chart shows that significantly more time is spent both building and searching the hash table. The size of the key shouldn't really matter in this case since these operations are working on the hash which is always 32-bit.

More concretely, the arrow::compute::SwissTable::search_block function is called an order of magnitude more times when working with numeric data. Based on my (admittedly rough) understanding I believe this suggests a less even distribution across hash buckets when working with numeric data.

This fix is going to require someone motivated to really dig in and understand how the hash table works.