Open michaelgaunt404 opened 8 months ago
I think it is related to #24956 and #34762
I have not tried it, but polars
' unnest might work well.
https://rpolars.github.io/man/LazyFrame_unnest.html
DuckDB also has unnest function. https://duckdb.org/docs/sql/query_syntax/unnest.html
Describe the usage question you have. Please include as many useful details as possible.
Is there a tidyr::unnest equivalent for Arrow datasets with multiple Parquet files?
I need to handle close to a hundred terabytes of Parquet files. Each file has an attribute with nested tables, and within these tables, there's another attribute containing OpenStreetMap IDs that require filtering. I need to cross-reference these IDs with attributes from another index. If it were a flat file or a long "tidy" data frame, it wouldn't be an issue, but the nested structure is complicating matters with the Arrow dataset object.
Currently, I employ an iterative approach, loading individual Parquet files into memory for filtering and saving (Actually I do this in parallel with the avaible cores on my computer). However, I've come across Arrow datasets, and the ability to lazily define operations before loading the object could greatly enhance speed.
See below images for reference of the data Im working with.
Component(s)
R