cmu-delphi / covidcast-indicators

Back end for producing indicators and loading them into the COVIDcast API.
https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html
MIT License
12 stars 17 forks source link

Indicator runners should output files with issue date #1907

Open melange396 opened 1 year ago

melange396 commented 1 year ago

Indicator runners should output CSV files with issue date, wherever possible.

Most indicators (if not all of the currently active indicators) output CSV files without an "issue date" saved/encoded anywhere in or around them... They assume the issue date is "today", and that the files will be ingested into the database the same day (our acquisition process also assumes an issue date of "today" (by default) upon reading these files). This can lead to inaccurate "issue" columns when the data finally makes it to the database, if the acquisition job(s) are broken, backed up by a long queue, or otherwise delayed.

If we export with an explicit issue date, it does not matter when the files are consumed, the "issue" should still be accurate. In fact, this can make it so re-importing the same CSV files multiple times is an idempotent operation. It will help us when there are problems with our systems in real-time (as listed above), plus it will simplify things if we need to import CSV files at some later date (such as adding new data files on top of a restored database snapshot). This can also be useful when the external data source specifies an issue date explicitly.

There is a provision in our acquisition process to use an issue date that is taken from the directory structure. The "nowcast" indicator seems to be able to produce this directory structure, but AFAICT this indicator is not being run successfully anywhere at present.

melange396 commented 1 year ago

see my novelesque slack message about applying this to the hhs indicator, whose source data includes an issue date that can be carried through to the output files.

melange396 commented 4 months ago

Note that this change is necessary but probably not sufficient for making our process resilient to assuming an "issue" date of now/today; parts of the acquisition process, including the input file directory structure and the post-process file handling behaviors will need to change, along with similar/analogous changes to the things in Cronicle that move these files around and trigger acquisition runs.

minhkhul commented 2 months ago

There are 2 ways I believe we can go about this.

Approach 1: New PATTERN_ISSUE_DAILY

Right now we use these naming patterns to get the right time_value, signal, geo_type. Issue value is just the immediate date when data was processed. So let's add a new issue-date-included filename format and use it. This includes adding on to acquisition code ways to process csv file names based on this new filename format and add on to export_to_csv to allow exporting csv files with this new name pattern. Then move indicators one by one to using this new format.

Steps:

Approach 2: Use batch issue upload as is

Instead of adding a new name pattern, just use the batch issue upload mechanism we've been using for patching so far and apply it to normal daily runs.

Steps:

@melange396 Which way do you think we should move forward with?

minhkhul commented 2 months ago

I'm working on approach 1. Seems more comprehensive and less shuffling current directory setups in prod, which will disrupt current runs.

minhkhul commented 1 month ago

Block by archive differ review. The archive differ saves the latest issue of a time_value. For example, assume data in this time_value/geo/signal combinations 20200506_state_deaths_7dav_cumulative_num.csvhas two versions, one posted on issue 20241009 and one posted on 20241010. Right now, our archive differ would correctly identify that there are potentially two versions of the same data, and process the data accordingly. If we bake the issue date into csv file name, like 20241009_20200506_state_deaths_7dav_cumulative_num.csv the current version of archive differ, as implemented here, won't be able to identify that the two csv 20241009_20200506_state_deaths_7dav_cumulative_num.csv and 20241010_20200506_state_deaths_7dav_cumulative_num.csv are referring to the same time_value, and would just save both to the s3 bucket, which defeats the purpose of archive differ. If we do some directory manipulation like the patching output structure, it will effect how export_dir is structured. With the way we run archive differ now, the pipeline would just break. So either way, it takes some rewriting of archive differ. And since the archive differ is going through some reviews, let's wait till that's done.