apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.59k stars 3.54k forks source link

[R] Passing a large dataset to duckdb and back results in memory being used and not freed #37495

Open thisisnic opened 1 year ago

thisisnic commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

I'll flesh this out with a proper reprex later; the dataset in question is the data from the Voltron Data S3 bucket

nyc_taxi <- open_dataset(here::here("data/nyc-taxi"))

nyc_taxi %>%
  filter(year == 2019) %>%
  to_duckdb() %>%
  mutate(mean_distance = mean(trip_distance)) %>%
  to_arrow() %>%
  filter(trip_distance < mean_distance) %>%
  count() %>%
  collect()

I'm seeing 22Gb of RAM usage here despite the returned result only being 1 row

Component(s)

R

thisisnic commented 1 year ago

A few things I'm observing;

thisisnic commented 1 year ago

I think this is a duckdb rather than an arrow bug actually.

thisisnic commented 1 year ago

@paleolimbot - mind taking a look at this? I'm struggling to pin down the source of this behaviour.

amoeba commented 1 year ago

I know we've see memory issues with PyArrow code like and I suspect the R package uses a similar code path:

table_ds = ds.dataset([path_to_parquet_file], filesystem=fs.LocalFileSystem())
conn.from_arrow(table_ds).limit(some_n).arrow()

from_arrow creates a Scanner on the Arrow side and the leak-like behavior only happens when the limit() clause is used. This is because the limit() triggers cancellation of Scanner and the Scanner isn't cancel aware. This is related to https://github.com/apache/arrow/issues/20338 I think.

Before filing an issue with DuckDB, I can take a look to see if it looks reasonable the same thing is happening here.

dhersz commented 1 week ago

Hi all. I'm still finding the same issue using arrow and duckdb. Not sure if you have been able to look into this any further, but it would be interesting to know if you have found a strategy to avoid it.

I created a dataset to work with the same code @thisisnic used as an example. To create it:

dataset <- data.frame(
  id = 1:200000000,
  year = sample(2017:2020, 200000000, replace = TRUE),
  trip_distance = rnorm(200000000, mean = 5, sd = 2)
)

tmpfile <- "tmpdata.parquet"
arrow::write_parquet(dataset, tmpfile)

And then to run the example:

nyc_taxi <- open_dataset("tmpdata.parquet")

nyc_taxi <- nyc_taxi %>%
  filter(year == 2019) %>%
  to_duckdb() %>%
  mutate(mean_distance = mean(trip_distance)) %>%
  to_arrow() %>%
  filter(trip_distance < mean_distance) %>%
  count() %>%
  collect()

After running the code, the memory used goes to around 4GB~ in my machine. After I run gc(), it goes to 2.8GB~.

amoeba commented 1 week ago

Thanks for adding a simple reproduction @dhersz. I haven't had time to move this forward. I still think there's a bug somewhere in here but this ticket needs a volunteer to find and fix it.

I think the best workaround at the moment would be to use duckdb, maybe with the duckdbfs package which provides an arrow-inspired open_dataset method:

library(duckdbfs)

# move `tmpdata.parquet` into a folder, call open_dataset on that top-level folder (`/path/to/toplevel` below)
nyc_taxi <- open_dataset("/path/to/toplevel")

nyc_taxi %>%
  filter(year == 2019) %>%
  mutate(mean_distance = mean(trip_distance)) %>%
  filter(trip_distance < mean_distance) %>%
  count() %>%
  collect()

On my system, R's memory usage peaks at 1GiB while processing and appears to release it all (shrinks back down to what it was before I ran the query).