Fixed-width files are a common data provisioning format for (very) large, administrative data files. We have been converting provisioned fwf files to .parquet and then leveraging arrow::open_dataset() with good success. However, we still run into RAM issues with the read-in step and are keen to try new approaches to this in-memory RAM issue (ideally without chunking files etc).
With an {arrow} fixed-width reader, we could perhaps leverage arrow::open_dataset(as_data_frame = FALSE) directly on a large fwf file and then convert to partitioned .parquet files with arrow::write_dataset()?
It would also be great to have this functionality exposed in Python. Currently one can use the Pandas fixed width reader and perform conversion to pyarrow, but that comes with many caveats.
Fixed-width files are a common data provisioning format for (very) large, administrative data files. We have been converting provisioned fwf files to
.parquet
and then leveragingarrow::open_dataset()
with good success. However, we still run into RAM issues with the read-in step and are keen to try new approaches to this in-memory RAM issue (ideally without chunking files etc).A simple, example workflow looks like this:
With an {arrow} fixed-width reader, we could perhaps leverage
arrow::open_dataset(as_data_frame = FALSE)
directly on a large fwf file and then convert to partitioned.parquet
files with arrow::write_dataset()?Reporter: Stephanie Hazlitt / @stephhazlitt
Related issues:
Note: This issue was originally created as ARROW-11587. Please see the migration documentation for further details.