Open nmdefries opened 1 year ago
We're currently using a special AWS readRDS
function, fread
should be able to read directly from the S3 bucket. fread
can read from URLs and the S3 bucket is publicly accessible.
My preference is to store the data as CSV, and read/write using data.table
for speed and potential for additional speedup. One consideration is that RDS stores the column types, so we will need to worry about, e.g. setting date timezones, or strings-as-factors causing weird behavior when switching to CSV.
data.table
is definitely faster writing, but only equally fast reading from the s3 bucket. The cause seems to be that compressed RDS files are smaller than gzipped CSVs. Since reading from the s3 bucket involves downloading an object to a temporary file, downloading the larger gzipped CSVs is slow. The increased download time eats into the speedup of using fread
over readRDS
.
A potential improvement to this would be to download objects from s3 to permanent files on disk (can do this manually when we detect a change in the bucket state or use aws.s3::s3sync
to keep files up to date) so that most users don't need to re-download objects and can just read from disk. Do files persist on the dashboard hosts?
Download time would also be less significant if we had more and smaller files so that we only have to load small subsets of the data.
I had originally wanted to store the data in uncompressed CSVs with the thought that a future optimization could append data to the end of an existing file without reading in the whole thing (if deduplication weren't needed). However, uncompressed CSVs are too slow to read directly from the s3 bucket to the dashboard to be feasible due to the temp download step. They could work if we kept an on-disk cache of the s3 bucket.
Running into memory issues, even when only processing hospitalizations.
Using RDS to read/write/store data is slow and not portable. We may want to switch the data pipeline or dashboard to being written in Python in the future, so using a format that Python can easily read would be preferable.
Looking at a comparison of different formats and packages, storing as a CSV and reading/writing using
data.table
viafread
andfwrite
seems pretty good.feather
could also be an option (is supported in both R and Python) but it's not clear if standard dataframe anddplyr
procedures work with it.data.table
data can be seamlessly processed either withdplyr
or with fasterdata.table
syntax.