DOI-USGS / national-flow-observations

This repository pulls national flow data from NWIS
Other
4 stars 8 forks source link

Use faster file format for temporary files #16

Closed wdwatkins closed 2 years ago

wdwatkins commented 3 years ago

Right now we are writing and reading at least several hundred (maybe over 1000) compressed RDS files for the data pull partitions. These files never go over the wire, so compression probably isn't worthwhile since it slows down read/write. We could switch to something like fst or maybe qs (a newer one, used in targets).

wdwatkins commented 3 years ago

Writing out the single large dv and uv files is also very slow. These do get uploaded to S3, so size on disk is relevant there, but there might be faster options.

padilla410 commented 2 years ago

I'm going to go with qs. It looks like fst might be faster, but from one blog:

Between both the `fst} package and readRDS() function in base R is quick serialization with the {qs} package. The average read time is almost twice the average read time from the {fst} package. However, the difference is a matter of seconds. Plus, the {qs} package allows the user to retain important data frame information, like variable/value labels and whether the data frame is a tibble.

...and the qs documentation:

saveRDS and readRDS are the standard for serialization of R data, but these functions are not optimized for speed. On the other hand, fst is extremely fast, but only works on data.frame’s and certain column types.

The only thing giving me pause at this point is the fact that qs is currently v 0.25.3 which may not be as stable as we would like longterm; however, it is supported by targets.

padilla410 commented 2 years ago

@wdwatkins I have modified the pipeline to use qs to download the individual partitions to the 10_nwis_pull/tmp directory (e.g., dv_210422_001.qs instead of dv_210422_001.rds).

Do you have an opinion about the format for nwis_dv_data.rds? Should this stay RDS or should it also be modified to qs?

wdwatkins commented 2 years ago

I suppose probably best to leave it at RDS for now? Could consider feather to allow us in Python, not sure how that compares in terms of read/write time (or how much that matters in general). But inter-operability is definitely a priority for the final output. Something like qs is probably better kept internal.