Open thisisnic opened 1 year ago
A much simpler reprex:
library(arrow)
library(dplyr)
tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, am), tf)
open_dataset(tf) |>
filter(cyl == 6) %>%
to_duckdb() %>%
mutate(mean_hp = mean(hp)) %>%
to_arrow() %>%
filter(hp < mean_hp) %>%
nrow()
I think this is a peculiarity of our dim.arrow_dplyr_query()
implementation, which uses Scanner$CountRows()
. For example, a regular collect()
works even though dim()
doesn't:
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)
tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, am), tf)
# fine?
open_dataset(tf) |>
filter(cyl == 6) %>%
to_duckdb() %>%
mutate(mean_hp = mean(hp)) %>%
to_arrow() %>%
filter(hp < mean_hp) %>%
collect()
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # A tibble: 4 × 12
#> mpg cyl disp hp drat wt qsec vs gear carb am mean_hp
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 21.4 6 258 110 3.08 3.22 19.4 1 3 1 0 122.
#> 2 18.1 6 225 105 2.76 3.46 20.2 1 3 1 0 122.
#> 3 21 6 160 110 3.9 2.62 16.5 0 4 4 1 122.
#> 4 21 6 160 110 3.9 2.88 17.0 0 4 4 1 122.
# fine
open_dataset(tf) |>
filter(cyl == 6) %>%
to_duckdb() %>%
mutate(mean_hp = mean(hp)) %>%
to_arrow() %>%
filter(hp < mean_hp) %>%
collect()
#> # A tibble: 4 × 12
#> mpg cyl disp hp drat wt qsec vs gear carb am mean_hp
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 21.4 6 258 110 3.08 3.22 19.4 1 3 1 0 122.
#> 2 18.1 6 225 105 2.76 3.46 20.2 1 3 1 0 122.
#> 3 21 6 160 110 3.9 2.62 16.5 0 4 4 1 122.
#> 4 21 6 160 110 3.9 2.88 17.0 0 4 4 1 122.
# error
open_dataset(tf) |>
filter(cyl == 6) %>%
to_duckdb() %>%
mutate(mean_hp = mean(hp)) %>%
to_arrow() %>%
filter(hp < mean_hp) %>%
dim()
#> Error: NotImplemented: Call to R (SafeRecordBatchReader::ReadNext()) from a non-R thread from an unsupported context
The error traceback:
Error: NotImplemented: Call to R (SafeRecordBatchReader::ReadNext()) from a non-R thread from an unsupported context
dataset___Scanner__CountRows(self) at dataset-scan.R#85
Scanner$create(x)$CountRows() at dplyr.R#186
dim.arrow_dplyr_query(x)
dim(x)
nrow(.)
The workaround would be to use count()
and pull(n)
. This works because executing an exec plan is one of the "supported contexts" for SafeCallIntoR()
(calling a Scanner
method is not, and probably shouldn't be since as far as I know the Scanner methods are all implementable using an exec plan).
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)
tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, am), tf)
open_dataset(tf) |>
filter(cyl == 6) %>%
to_duckdb() %>%
mutate(mean_hp = mean(hp)) %>%
to_arrow() %>%
filter(hp < mean_hp) %>%
count() |>
pull(n)
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> Warning: Default behavior of `pull()` on Arrow data is changing. Current behavior of returning an R vector is deprecated, and in a future release, it will return an Arrow `ChunkedArray`. To control this:
#> ℹ Specify `as_vector = TRUE` (the current default) or `FALSE` (what it will change to) in `pull()`
#> ℹ Or, set `options(arrow.pull_as_vector)` globally
#> This warning is displayed once every 8 hours.
#> [1] 4
A more permanent solution would be to reimplement dim()
for a dplyr query using an exec plan.
Describe the bug, including details regarding any error messages, version, and platform.
I'm testing out some code for a workshop on a fresh install (all packages just downloaded, Arrow built from source from 13.0.0 release candidate 3 branch) and get the following error:
sessionInfo() output:
Component(s)
R