apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.54k stars 3.54k forks source link

[R] tidyr::unnest function for arrow dataset object containing many parquet objects #40255

Open michaelgaunt404 opened 8 months ago

michaelgaunt404 commented 8 months ago

Describe the usage question you have. Please include as many useful details as possible.

Is there a tidyr::unnest equivalent for Arrow datasets with multiple Parquet files?

I need to handle close to a hundred terabytes of Parquet files. Each file has an attribute with nested tables, and within these tables, there's another attribute containing OpenStreetMap IDs that require filtering. I need to cross-reference these IDs with attributes from another index. If it were a flat file or a long "tidy" data frame, it wouldn't be an issue, but the nested structure is complicating matters with the Arrow dataset object.

Currently, I employ an iterative approach, loading individual Parquet files into memory for filtering and saving (Actually I do this in parallel with the avaible cores on my computer). However, I've come across Arrow datasets, and the ability to lazily define operations before loading the object could greatly enhance speed.

See below images for reference of the data Im working with.

image

image

Component(s)

R

eitsupi commented 8 months ago

I think it is related to #24956 and #34762

I have not tried it, but polars' unnest might work well. https://rpolars.github.io/man/LazyFrame_unnest.html

DuckDB also has unnest function. https://duckdb.org/docs/sql/query_syntax/unnest.html