cmu-delphi / covidcast-indicators

Back end for producing indicators and loading them into the COVIDcast API.
https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html
MIT License
12 stars 17 forks source link

Add `nchs-mortality` raw data backups and backup export utility #2065

Closed nmdefries closed 3 weeks ago

nmdefries commented 1 month ago

Description

Add nchs-mortality raw data backups and backup export utility

Changelog

Associated Issue(s)

Context and writeup

nmdefries commented 1 month ago

I guess the test is failing (on linting, with delphi_nchs_mortality/pull.py:11:0: E0611: No name 'create_backup_csv' in module 'delphi_utils' (no-name-in-module)) because the new fn is being added to delphi_utils at the same time.

Also, tests for the new create_backup_csv fn need to be added, but this is the idea for how this should work. Adding backups for other indicators should be faster.

nmdefries commented 1 month ago

Thanks for your quick feedback @minhkhul!

Add some logging to note on which indicator stashing is done

Agreed. Related to this, @korlaxxalrok suggested including metadata in each day's backup data or unique IDs we can use to track provenance of downstream data. Designing that will likely be too complex and thus take too long for getting V1 of data backups out, but could be very useful in the future.

Adjust the params.json.template in nchs_mortality as well.

I don't have strong feelings about this, but given the default the custom_run param takes in the code means we don't necessarily need to add it to params.json.

suggestion: When I wrote and run the script to stash nssp source similar to this on one, the small vm ran out of disk space at one point. To save disk space, apart from adding zipping, I also added a feature to check if there has been changes at all to the dataset in comparison to the latest past csv.gz on disk, and only save the latest new version of the dataset after confirming there's a difference. It's helpful on a weekly signal like nssp. I think it'd be nice to add that but not needed.

Hm, so we've found that saving data like this causes storage issues. Since you refer to a "vm", I wonder if the limit you hit was that of the VM (O(1 GB)) rather than with the host machine (O(100 GB)). How big is that entire collection of backups?

RE "only sav[ing] the latest new version of the dataset after confirming there's a difference" with the last backup, do we think this is safe/robust enough to do? One initial concern is that this is starting to sound like "archive differ V2". Of course, it's simpler than the current one, but any extra code increases the risk of introducing bugs. To know how to balance the risk, we'd want an estimate of how big the data backups would be.

minhkhul commented 1 month ago

Yep I very much agree with the potential for an archive differ v2 problem. Let's scratch that for now.

minhkhul commented 1 month ago

Also, been running this locally daily this since yesterday at the same time normal nchs run and keep the backup file, so we can take our time w this PR.