[C++] Implement a fixed-width file reader

Fixed-width files are a common data provisioning format for (very) large, administrative data files. We have been converting provisioned fwf files to .parquet and then leveraging arrow::open_dataset() with good success. However, we still run into RAM issues with the read-in step and are keen to try new approaches to this in-memory RAM issue (ideally without chunking files etc).

A simple, example workflow looks like this:


 sample_data <- "https://github.com/bcgov/dipr/raw/master/inst/extdata/starwars-fwf.dat.gz"
vroom::vroom_fwf(sample_data,
   col_positions = vroom::fwf_positions(
     c(1, 22, 25, 31),
     c(21, 24, 30, 35),
     c("name", "height", "mass", "has_hair")
     ),
     col_types = ("cnnl")
) %>%
    dplyr::group_by(has_hair) %>%
    arrow::write_dataset(path = "starwars_parquet",
     format = "parquet")

With an {arrow} fixed-width reader, we could perhaps leverage arrow::open_dataset(as_data_frame = FALSE) directly on a large fwf file and then convert to partitioned .parquet files with arrow::write_dataset()?

Reporter: Stephanie Hazlitt / @stephhazlitt

Related issues:

[R] Implement functionality to read fixed-width files (is related to)

_{Note: This issue was originally created as ARROW-11587. Please see the migration documentation for further details.}

apache / arrow

[C++] Implement a fixed-width file reader #27457

Related issues: