Closed wdwatkins closed 2 years ago
Writing out the single large dv and uv files is also very slow. These do get uploaded to S3, so size on disk is relevant there, but there might be faster options.
I'm going to go with qs
. It looks like fst
might be faster, but from one blog:
Between both the `fst} package and readRDS() function in base R is quick serialization with the {qs} package. The average read time is almost twice the average read time from the {fst} package. However, the difference is a matter of seconds. Plus, the {qs} package allows the user to retain important data frame information, like variable/value labels and whether the data frame is a tibble.
...and the qs
documentation:
saveRDS
andreadRDS
are the standard for serialization of R data, but these functions are not optimized for speed. On the other hand,fst
is extremely fast, but only works on data.frame’s and certain column types.
The only thing giving me pause at this point is the fact that qs
is currently v 0.25.3
which may not be as stable as we would like longterm; however, it is supported by targets
.
@wdwatkins I have modified the pipeline to use qs
to download the individual partitions to the 10_nwis_pull/tmp
directory (e.g., dv_210422_001.qs
instead of dv_210422_001.rds
).
Do you have an opinion about the format for nwis_dv_data.rds
? Should this stay RDS
or should it also be modified to qs
?
I suppose probably best to leave it at RDS for now? Could consider feather
to allow us in Python, not sure how that compares in terms of read/write time (or how much that matters in general). But inter-operability is definitely a priority for the final output. Something like qs
is probably better kept internal.
Right now we are writing and reading at least several hundred (maybe over 1000) compressed RDS files for the data pull partitions. These files never go over the wire, so compression probably isn't worthwhile since it slows down read/write. We could switch to something like
fst
or maybeqs
(a newer one, used in targets).