apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.3k stars 3.48k forks source link

[R] Filter operations not shown when called before summarise #14732

Open DavZim opened 1 year ago

DavZim commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

When I print an arrow dataset-query that involves filtering then summarising, the filtering operation is not shown.

For example:

library(arrow)
library(dplyr)
ds_file <- file.path(tempdir(), "mtcars")

write_dataset(mtcars |> select(mpg, cyl), ds_file)
ds <- open_dataset(ds_file)

# filter is printed | EXPECTED
ds |> filter(mpg > 25)
#> FileSystemDataset (query)
#> mpg: double
#> cyl: double
#> 
#> * Filter: (mpg > 25)                                #<======
#> See $.data for the source Arrow object

# filter is NOT printed | EXPECTED
ds |> 
  filter(mpg > 25) |> 
  summarise(mpg = mean(mpg))
#> FileSystemDataset (query)
#> mpg: double
#>                                                               #<==== Missing?!
#> See $.data for the source Arrow object

# first is NOT printed | NOT EXPECTED!
# second filter is printed | EXPECTED
ds |> 
  filter(mpg > 25) |> 
  summarise(mpg = mean(mpg)) |> 
  filter(mpg  > 0)
#> FileSystemDataset (query)
#> mpg: double
#> 
#> * Filter: (mpg > 0)                                  #<==== Missing mpg > 25 ?!
#> See $.data for the source Arrow object

I would expect to see the filtering as well as the summarise command of the query as well.

I use R 4.1.1 with arrow version 10.0.0

Arrow package version: 10.0.0

Capabilities:

dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3        FALSE
gcs       FALSE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip      FALSE
brotli    FALSE
zstd      FALSE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2       FALSE
jemalloc  FALSE
mimalloc   TRUE

To reinstall with more optional capabilities enabled, see
   https://arrow.apache.org/docs/r/articles/install.html

Memory:

Allocator mimalloc
Current    0 bytes
Max        0 bytes

Runtime:

SIMD Level          avx2
Detected SIMD Level avx2

Build:

C++ Library Version  10.0.0
C++ Compiler            GNU
C++ Compiler Version  7.5.0

Component(s)

R

assignUser commented 1 year ago

Related SO question with context by OP: https://stackoverflow.com/questions/74570604/rlanghash-cannot-differentiate-between-arrow-queries

DavZim commented 1 year ago

Thanks for linking this here as well. I am not sure if this issue will solve the SO question as well, but I will happily test to see if it does. 😄

paleolimbot commented 1 year ago

Thanks for posting! I think this is a situation where the source of the dplyr query is another dplyr query:

library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)
ds_file <- file.path(tempdir(), "mtcars")

write_dataset(mtcars |> select(mpg, cyl), ds_file)
ds <- open_dataset(ds_file)

# filter is printed | EXPECTED
q <- ds |> filter(mpg > 25)
class(q$.data)
#> [1] "FileSystemDataset" "Dataset"           "ArrowObject"      
#> [4] "R6"

q <- ds |> 
  filter(mpg > 25) |> 
  summarise(mpg = mean(mpg))
class(q$.data)
#> [1] "arrow_dplyr_query"

print(q$.data)
#> FileSystemDataset (query)
#> mpg: double
#> 
#> * Aggregations:
#> mpg: mean(mpg)
#> * Filter: (mpg > 25)
#> See $.data for the source Arrow object
print(q)
#> FileSystemDataset (query)
#> mpg: double
#> 
#> See $.data for the source Arrow object

Created on 2022-11-25 with reprex v2.0.2

We probably just need to make sure to print() the .data recursively if inherits(.data, "arrow_dplyr_query") (or else some of the steps may be hidden, as you discovered!

The print method is here: https://github.com/apache/arrow/blob/2078af7c710d688c14313b9486b99c981550a7b7/r/R/dplyr.R#L116-L168

DavZim commented 1 year ago

Hi @paleolimbot, is there any way how I can help with this bug?

paleolimbot commented 1 year ago

If you'd like to give a go at opening a PR, I believe it's something like adding

if (inherits(x$.data, "arrow_dplyr_query")) {
  print(x$.data, ...)
  cat("------\n")
}

Here:

https://github.com/apache/arrow/blob/2078af7c710d688c14313b9486b99c981550a7b7/r/R/dplyr.R#L118

...plus adding a test to tests/testthat/test-dplyr.R...something like expect_snapshot(print(some_table %>% filter(mpg > 25) %>% summarise(mpg = mean(mpg)))).

Getting set up with an Arrow development environment can take some effort (although we're trying to make it easier). If it's useful, the command I often use to build Arrow is https://gist.github.com/paleolimbot/be77218201bdbd20353c084c74254824#file-dockerfile-L5 .

This will almost certainly make the 12.0.0 release in a few months if you don't get to it!

DavZim commented 1 year ago

Ill try to look into it later this or next week. While at this, I think it would be good to show the file in the printout as well. That is, if we have two files with the same datastructure but different data + different names, any caching on it (because it works on the printout of the query, I think) might be a cache overlap. For example taking the example above, I would suggest something like this as an output

library(arrow)
library(dplyr)
ds_file <- file.path(tempdir(), "mtcars")

write_dataset(mtcars |> select(mpg, cyl), ds_file)
ds <- open_dataset(ds_file)

ds |> filter(mpg > 25)
#> FileSystemDataset (query) File: /tmp/RtmptUJio4/mtcars                    #<===== added this
#> mpg: double
#> cyl: double
#> 
#> * Filter: (mpg > 25)
#> See $.data for the source Arrow object

Do you think this is worth it and should be packed in this PR/Issue as well or should I open a new issue?

paleolimbot commented 1 year ago

I think those are separate issues...one is the print() method for a FileSystemDataset; the other is a print() method for an arrow_dplyr_query. I think that information should print on print(the_query_object$.data) (rather than print(the_query_object). Also note that FileSystemDatasets can be made up of hundreds/thousands of files, so it's more complicated than printing out just one file name.