OHDSI / Andromeda

AsynchroNous Disk-based Representation of MassivE DAta: An R package aimed at replacing ff for storing large data objects.
https://ohdsi.github.io/Andromeda/
11 stars 9 forks source link

Don't copy arrow object when calling batchApply() on a dplyr query #53

Open schuemie opened 1 year ago

schuemie commented 1 year ago

A common pattern in HADES is to do some dplyr query on an Andromeda table, and then call batchApply() on that query object (e.g. by Cyclops). However, the current Andromeda implementation in this scenario always first copies the result of the query into a new Andromeda object before batching. My guess is this is because arrow::ScannerBuilder$create() does not accept a arrow_dplyr_query object.

But I did find the arrow::as_record_batch_reader() function works fine with arrow_dplyr_query objects:

 a <- andromeda(cars = cars)
dplyrQuery <- dplyr::filter(a$cars, speed > 10)
reader <- arrow::as_record_batch_reader(dplyrQuery)
head(as.data.frame(reader$read_next_batch()), 5)
# speed dist
# 1    11   17
# 2    11   28
# 3    12   14
# 4    12   20
# 5    12   24

The only downside is you can't set the batch size, although it definitely does batching.

I would propose to use this avoid having to copy the query result into an arrow object (which may eat a lot of resources, and also has the issue on Windows that we can't delete the temp Andromeda object).