apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.42k stars 3.51k forks source link

[R][C++] Upcasting from int32 to int64 when joining two tables #35333

Open egillax opened 1 year ago

egillax commented 1 year ago

Describe the usage question you have. Please include as many useful details as possible.

When joining dataframes on columns that might have different types (int32 vs int64, int32 vs numeric) dplyr will upcast. But arrow will give an error.

As a user I would prefer to have the same behavior as dplyr. Is this intended behaviour?

Small reprex:

# int 32 id column
df1 <- data.frame(id=sample(1:1000, replace=F),
                  value=runif(1000))

# different type id column
type <- bit64::as.integer64 # or as.numeric
df2 <- data.frame(id=type(sample(1:300, replace=F)),
                  value2=runif(300))

results <- df1 |> dplyr::inner_join(df2, by="id")

# dplyr upcasts int32 to int64, or int 32 to numeric
str(results)

arrowTable <- arrow::as_arrow_table(df1)
arrowTable2 <- arrow::as_arrow_table(df2)

# arrow errors
resultsArrow <- arrowTable |> dplyr::inner_join(arrowTable2, by="id") |> dplyr::compute()

Component(s)

R

thisisnic commented 1 year ago

Thanks for reporting this @egillax. This is one of the known differences between dplyr and arrow, and all I can think of to recommend is ensuring that table schemas match (e.g. calling arrowTable2 <- arrow::as_arrow_table(df2, schema = schema(id = int32(), value2 = double()))) before performing the join.

That said though, we do upcast some values in other areas of the codebase, so it could be a reasonable change to request, will ping some people who can say more about whether this would be feasible as a feature request or not.

westonpace commented 1 year ago

In general, we try and avoid planning and optimization in Acero. Building a full planner or optimizer is just too much work for the capacity we have. My hope is that someday a Substrait planner/optimizer will emerge but we are probably years out from that.

Now, one could make the argument that this, similar to our implicit conversion in expressions, is a small enough addition and not going to cause us to fall off the slippery slope into planning.

The implicit conversion that we do already handle, which @thisisnic mentions, is done when we are binding expressions to a schema. The rules there are actually encoded into the functions themselves (e.g. each function gets to decide which implicit casts are appropriate). As a result, it isn't something that is easily reusable in its current form.

So I guess this is kind of a non-answer to the "can we do it?" question. I would say that it's doable, and if someone wanted to take this work on I would be willing to get it merged in. However, it is not a priority for me personally at the moment.

Is there any way to do the explicit cast in dplyr? E.g. instead of making sure the schemas match in dplyr could you read in the data with the true schema and then cast id to id_64 (or id_32) right before the join?

thisisnic commented 1 year ago

Thanks for responding there @westonpace!

Yes, an alternative to my code suggestion above could be to use mutate() to cast id to an int64, e.g. finish the code block there with:

resultsArrow <- arrowTable |> 
  mutate(id = cast(id, int64())) |>
  dplyr::inner_join(arrowTable2, by="id") |> dplyr::compute()