Open thisisnic opened 1 year ago
A few things I'm observing;
gc()
doesn't fix itI think this is a duckdb rather than an arrow bug actually.
@paleolimbot - mind taking a look at this? I'm struggling to pin down the source of this behaviour.
I know we've see memory issues with PyArrow code like and I suspect the R package uses a similar code path:
table_ds = ds.dataset([path_to_parquet_file], filesystem=fs.LocalFileSystem())
conn.from_arrow(table_ds).limit(some_n).arrow()
from_arrow
creates a Scanner on the Arrow side and the leak-like behavior only happens when the limit()
clause is used. This is because the limit() triggers cancellation of Scanner and the Scanner isn't cancel aware. This is related to https://github.com/apache/arrow/issues/20338 I think.
Before filing an issue with DuckDB, I can take a look to see if it looks reasonable the same thing is happening here.
Hi all. I'm still finding the same issue using arrow and duckdb. Not sure if you have been able to look into this any further, but it would be interesting to know if you have found a strategy to avoid it.
I created a dataset to work with the same code @thisisnic used as an example. To create it:
dataset <- data.frame(
id = 1:200000000,
year = sample(2017:2020, 200000000, replace = TRUE),
trip_distance = rnorm(200000000, mean = 5, sd = 2)
)
tmpfile <- "tmpdata.parquet"
arrow::write_parquet(dataset, tmpfile)
And then to run the example:
nyc_taxi <- open_dataset("tmpdata.parquet")
nyc_taxi <- nyc_taxi %>%
filter(year == 2019) %>%
to_duckdb() %>%
mutate(mean_distance = mean(trip_distance)) %>%
to_arrow() %>%
filter(trip_distance < mean_distance) %>%
count() %>%
collect()
After running the code, the memory used goes to around 4GB~ in my machine. After I run gc()
, it goes to 2.8GB~.
Thanks for adding a simple reproduction @dhersz. I haven't had time to move this forward. I still think there's a bug somewhere in here but this ticket needs a volunteer to find and fix it.
I think the best workaround at the moment would be to use duckdb, maybe with the duckdbfs package which provides an arrow-inspired open_dataset
method:
library(duckdbfs)
# move `tmpdata.parquet` into a folder, call open_dataset on that top-level folder (`/path/to/toplevel` below)
nyc_taxi <- open_dataset("/path/to/toplevel")
nyc_taxi %>%
filter(year == 2019) %>%
mutate(mean_distance = mean(trip_distance)) %>%
filter(trip_distance < mean_distance) %>%
count() %>%
collect()
On my system, R's memory usage peaks at 1GiB while processing and appears to release it all (shrinks back down to what it was before I ran the query).
Describe the bug, including details regarding any error messages, version, and platform.
I'll flesh this out with a proper reprex later; the dataset in question is the data from the Voltron Data S3 bucket
I'm seeing 22Gb of RAM usage here despite the returned result only being 1 row
Component(s)
R