Save output as CSV, and read and write using `data.table`

cmu-delphi / forecast-eval

delphi.cmu.edu/forecast-eval

MIT License

5 stars 2 forks source link

Save output as CSV, and read and write using `data.table` #262

Open nmdefries opened 1 year ago

nmdefries commented 1 year ago

Using RDS to read/write/store data is slow and not portable. We may want to switch the data pipeline or dashboard to being written in Python in the future, so using a format that Python can easily read would be preferable.

Looking at a comparison of different formats and packages, storing as a CSV and reading/writing using data.table via fread and fwrite seems pretty good. feather could also be an option (is supported in both R and Python) but it's not clear if standard dataframe and dplyr procedures work with it. data.table data can be seamlessly processed either with dplyr or with faster data.table syntax.

nmdefries commented 1 year ago

We're currently using a special AWS readRDS function, fread should be able to read directly from the S3 bucket. fread can read from URLs and the S3 bucket is publicly accessible.

nmdefries commented 1 year ago

My preference is to store the data as CSV, and read/write using data.table for speed and potential for additional speedup. One consideration is that RDS stores the column types, so we will need to worry about, e.g. setting date timezones, or strings-as-factors causing weird behavior when switching to CSV.

nmdefries commented 1 year ago

data.table is definitely faster writing, but only equally fast reading from the s3 bucket. The cause seems to be that compressed RDS files are smaller than gzipped CSVs. Since reading from the s3 bucket involves downloading an object to a temporary file, downloading the larger gzipped CSVs is slow. The increased download time eats into the speedup of using fread over readRDS.

A potential improvement to this would be to download objects from s3 to permanent files on disk (can do this manually when we detect a change in the bucket state or use aws.s3::s3sync to keep files up to date) so that most users don't need to re-download objects and can just read from disk. Do files persist on the dashboard hosts?

Download time would also be less significant if we had more and smaller files so that we only have to load small subsets of the data.

nmdefries commented 1 year ago

I had originally wanted to store the data in uncompressed CSVs with the thought that a future optimization could append data to the end of an existing file without reading in the whole thing (if deduplication weren't needed). However, uncompressed CSVs are too slow to read directly from the s3 bucket to the dashboard to be feasible due to the temp download step. They could work if we kept an on-disk cache of the s3 bucket.

nmdefries commented 1 year ago

Running into memory issues, even when only processing hospitalizations.