apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.44k stars 3.51k forks source link

[R] to_arrow() loses group_by() #40640

Open xtimbeau opened 7 months ago

xtimbeau commented 7 months ago

Describe the bug, including details regarding any error messages, version, and platform.

When passing a duckdb dataset to arrow with to_arrow() grouping information is lost.

cars |> arrow::to_duckdb() |> dplyr::group_by(speed) |> arrow::to_arrow()
# output is ungrouped
#RecordBatchReader (query)
#speed: double
#dist: double

cars |> arrow::to_duckdb() |> arrow::to_arrow() |> dplyr::group_by(speed) 
# output is grouped
#RecordBatchReader (query)
#speed: double
#dist: double
#
#* Grouped by speed
#See $.data for the source Arrow object

Component(s)

R

thisisnic commented 7 months ago

Thanks @xtimbeau, can confirm this is reproducible on Ubuntu on Arrow 15.0.1, looks like it's something we might need to pull out of the DuckDB object and apply to the Arrow object, though I haven't checked to see if this is a regression or something we never implemented.

Would you be interested in submitting a PR at all?

xtimbeau commented 7 months ago

you mean correct the bug in the code? If it involves some R I can accept the challenge… but C I prefer to pass!

Le ven. 22 mars 2024 à 14:56, Nic Crane @.***> a écrit :

Thanks @xtimbeau https://github.com/xtimbeau, can confirm this is reproducible on Ubuntu on Arrow 15.0.1, looks like it's something we might need to pull out of the DuckDB object and apply to the Arrow object, though I haven't checked to see if this is a regression or something we never implemented.

Would you be interested in submitting a PR at all?

— Reply to this email directly, view it on GitHub https://github.com/apache/arrow/issues/40640#issuecomment-2015164232, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANA2KENUG5MZX6JO73CA743YZQ2BVAVCNFSM6AAAAABE35U5YWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVGE3DIMRTGI . You are receiving this because you were mentioned.Message ID: @.***>

brianmsm commented 3 months ago

This seems to be the expected behavior of dbplyrwhen group_by is applied without any additional operations:

https://github.com/apache/arrow/blob/1815a679e47a47f2f5f4cd003ffbb18b6db2f952/r/R/duckdb.R#L175

library(dplyr, warn.conflicts = FALSE)

con_duck <- DBI::dbConnect(duckdb::duckdb(), ":memory:")
copy_to(con_duck, mtcars)

tbl(con_duck, "mtcars") %>% 
  group_by(cyl) %>% 
  dbplyr::remote_query()
#> <SQL> SELECT *
#> FROM mtcars

con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con_sqlite, mtcars)

tbl(con_sqlite, "mtcars") %>% 
  group_by(cyl) %>% 
  dbplyr::remote_query()
#> <SQL> SELECT *
#> FROM `mtcars`

Created on 2024-06-26 with reprex v2.1.0

I imagine it is desired behavior to not translate the group_by to SQL if there are no additional operations... however, since to_arrow() and to_duckdb() can be used as intermediate work operations, I should include them.... Perhaps, a solution analogous to the one developed in to_duckdb() could be proposed, but I'm not sure

https://github.com/apache/arrow/blob/68db6621c07624babec05caccbf47c079a5ed20e/r/R/duckdb.R#L65-L68