[R] tidyr::unnest function for arrow dataset object containing many parquet objects

Describe the usage question you have. Please include as many useful details as possible.

Is there a tidyr::unnest equivalent for Arrow datasets with multiple Parquet files?

I need to handle close to a hundred terabytes of Parquet files. Each file has an attribute with nested tables, and within these tables, there's another attribute containing OpenStreetMap IDs that require filtering. I need to cross-reference these IDs with attributes from another index. If it were a flat file or a long "tidy" data frame, it wouldn't be an issue, but the nested structure is complicating matters with the Arrow dataset object.

Currently, I employ an iterative approach, loading individual Parquet files into memory for filtering and saving (Actually I do this in parallel with the avaible cores on my computer). However, I've come across Arrow datasets, and the ability to lazily define operations before loading the object could greatly enhance speed.

See below images for reference of the data Im working with.

apache / arrow

[R] tidyr::unnest function for arrow dataset object containing many parquet objects #40255

Describe the usage question you have. Please include as many useful details as possible.

Component(s)