OHDSI / Andromeda

AsynchroNous Disk-based Representation of MassivE DAta: An R package aimed at replacing ff for storing large data objects.
https://ohdsi.github.io/Andromeda/
11 stars 9 forks source link

arrow version: upcoming changes to pull() behavior #47

Open schuemie opened 1 year ago

schuemie commented 1 year ago

When using pull(), we get this warning:

Warning message:
Default behavior of `pull()` on Arrow data is changing. Current behavior of returning an R vector is deprecated, and in a future release, it will return an Arrow `ChunkedArray`. To control this:
i Specify `as_vector = TRUE` (the current default) or `FALSE` (what it will change to) in `pull()`
i Or, set `options(arrow.pull_as_vector)` globally
This warning is displayed once every 8 hours. 

This warning is annoying, and the advertised new behavior will break many HADES packages when it becomes the default in some future release of arrow.

I don't have a good solution here. Is there an alternative to pull()? Should we set options(arrow.pull_as_vector) in Andromeda's onLoad() function? (Would CRAN allow that?)

ablack3 commented 1 year ago

I'm not sure if CRAN allows that or not. I think it's generally not recommended to set options for the user but we could check the option and print a message if it is not set.

An alternative to pull would be to use select then collect

library(Andromeda)

a <- andromeda(cars = cars)
a$cars %>% pull(speed)
#>  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
#> [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
a$cars %>% select(speed) %>% collect() %>% {.[["speed"]]}
#>  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
#> [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25

Created on 2023-03-17 with reprex v2.0.2

This example is using the current release but it should work with the new release as well.

ablack3 commented 1 year ago

I commented on the issue: https://github.com/apache/arrow/issues/32705

I think I'd propose printing a message or warning in Andromeda's onload if the options(arrow.pull_as_vector) is not set.

Alternatively I could provide a function that would have the same behavior as pull (returns vector) and we could switch to that.

Do you think it would be possible or even advantageous to use chucked arrays instead of R vectors? One benefit we have in Andromeda is that

ablack3 commented 1 year ago

Another option is to add withr::local_options() at the beginning of functions that use pull on Andromeda tables.