Open DavZim opened 1 year ago
Related SO question with context by OP: https://stackoverflow.com/questions/74570604/rlanghash-cannot-differentiate-between-arrow-queries
Thanks for linking this here as well. I am not sure if this issue will solve the SO question as well, but I will happily test to see if it does. 😄
Thanks for posting! I think this is a situation where the source of the dplyr query is another dplyr query:
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)
ds_file <- file.path(tempdir(), "mtcars")
write_dataset(mtcars |> select(mpg, cyl), ds_file)
ds <- open_dataset(ds_file)
# filter is printed | EXPECTED
q <- ds |> filter(mpg > 25)
class(q$.data)
#> [1] "FileSystemDataset" "Dataset" "ArrowObject"
#> [4] "R6"
q <- ds |>
filter(mpg > 25) |>
summarise(mpg = mean(mpg))
class(q$.data)
#> [1] "arrow_dplyr_query"
print(q$.data)
#> FileSystemDataset (query)
#> mpg: double
#>
#> * Aggregations:
#> mpg: mean(mpg)
#> * Filter: (mpg > 25)
#> See $.data for the source Arrow object
print(q)
#> FileSystemDataset (query)
#> mpg: double
#>
#> See $.data for the source Arrow object
Created on 2022-11-25 with reprex v2.0.2
We probably just need to make sure to print()
the .data
recursively if inherits(.data, "arrow_dplyr_query")
(or else some of the steps may be hidden, as you discovered!
The print method is here: https://github.com/apache/arrow/blob/2078af7c710d688c14313b9486b99c981550a7b7/r/R/dplyr.R#L116-L168
Hi @paleolimbot, is there any way how I can help with this bug?
If you'd like to give a go at opening a PR, I believe it's something like adding
if (inherits(x$.data, "arrow_dplyr_query")) {
print(x$.data, ...)
cat("------\n")
}
Here:
https://github.com/apache/arrow/blob/2078af7c710d688c14313b9486b99c981550a7b7/r/R/dplyr.R#L118
...plus adding a test to tests/testthat/test-dplyr.R...something like expect_snapshot(print(some_table %>% filter(mpg > 25) %>% summarise(mpg = mean(mpg))))
.
Getting set up with an Arrow development environment can take some effort (although we're trying to make it easier). If it's useful, the command I often use to build Arrow is https://gist.github.com/paleolimbot/be77218201bdbd20353c084c74254824#file-dockerfile-L5 .
This will almost certainly make the 12.0.0 release in a few months if you don't get to it!
Ill try to look into it later this or next week. While at this, I think it would be good to show the file in the printout as well. That is, if we have two files with the same datastructure but different data + different names, any caching on it (because it works on the printout of the query, I think) might be a cache overlap. For example taking the example above, I would suggest something like this as an output
library(arrow)
library(dplyr)
ds_file <- file.path(tempdir(), "mtcars")
write_dataset(mtcars |> select(mpg, cyl), ds_file)
ds <- open_dataset(ds_file)
ds |> filter(mpg > 25)
#> FileSystemDataset (query) File: /tmp/RtmptUJio4/mtcars #<===== added this
#> mpg: double
#> cyl: double
#>
#> * Filter: (mpg > 25)
#> See $.data for the source Arrow object
Do you think this is worth it and should be packed in this PR/Issue as well or should I open a new issue?
I think those are separate issues...one is the print()
method for a FileSystemDataset
; the other is a print()
method for an arrow_dplyr_query
. I think that information should print on print(the_query_object$.data)
(rather than print(the_query_object)
. Also note that FileSystemDataset
s can be made up of hundreds/thousands of files, so it's more complicated than printing out just one file name.
Describe the bug, including details regarding any error messages, version, and platform.
When I print an arrow dataset-query that involves filtering then summarising, the filtering operation is not shown.
For example:
I would expect to see the filtering as well as the summarise command of the query as well.
I use R 4.1.1 with arrow version 10.0.0
Component(s)
R