melange396 commented 1 year ago

Indicator runners should output CSV files with issue date, wherever possible.

Most indicators (if not all of the currently active indicators) output CSV files without an "issue date" saved/encoded anywhere in or around them... They assume the issue date is "today", and that the files will be ingested into the database the same day (our acquisition process also assumes an issue date of "today" (by default) upon reading these files). This can lead to inaccurate "issue" columns when the data finally makes it to the database, if the acquisition job(s) are broken, backed up by a long queue, or otherwise delayed.

If we export with an explicit issue date, it does not matter when the files are consumed, the "issue" should still be accurate. In fact, this can make it so re-importing the same CSV files multiple times is an idempotent operation. It will help us when there are problems with our systems in real-time (as listed above), plus it will simplify things if we need to import CSV files at some later date (such as adding new data files on top of a restored database snapshot). This can also be useful when the external data source specifies an issue date explicitly.

There is a provision in our acquisition process to use an issue date that is taken from the directory structure. The "nowcast" indicator seems to be able to produce this directory structure, but AFAICT this indicator is not being run successfully anywhere at present.

melange396 commented 1 year ago

see my novelesque slack message about applying this to the hhs indicator, whose source data includes an issue date that can be carried through to the output files.

melange396 commented 4 months ago

Note that this change is necessary but probably not sufficient for making our process resilient to assuming an "issue" date of now/today; parts of the acquisition process, including the input file directory structure and the post-process file handling behaviors will need to change, along with similar/analogous changes to the things in Cronicle that move these files around and trigger acquisition runs.

minhkhul commented 2 months ago

There are 2 ways I believe we can go about this.

Approach 1: New PATTERN_ISSUE_DAILY

Right now we use these naming patterns to get the right time_value, signal, geo_type. Issue value is just the immediate date when data was processed. So let's add a new issue-date-included filename format and use it. This includes adding on to acquisition code ways to process csv file names based on this new filename format and add on to export_to_csv to allow exporting csv files with this new name pattern. Then move indicators one by one to using this new format.

Steps:

Acquisition code refactor:
- Add this new pattern to class CsvImporter:
```
#.../source/issueyyyymmdd_timevalueyyyymmdd_geotype_signal_name.csv
PATTERN_ISSUE_DAILY=r'^.*/([^/]*)/(\d{8})_(\d{8})_(\w+?)_(\w+)\.csv$'
```
  This pattern basically just add issue date to the front of the existing PATTERN_DAILY:
```
#.../source/timevalueyyyymmdd_geotype_signal_name.csv)
PATTERN_DAILY=r'^.*/([^/]*)/(\d{8})_(\w+?)_(\w+)\.csv$'
```
- Similarly, add the weekly format equivalent, so this is applicable to weekly time type indicators as well.
- Adjust current patterns to avoid future pattern overlap. A file name like 20240907_20240901_geo_signal_name.csv would be caught by both PATTERN_DAILY and PATTERN_ISSUE_DAILY. So to avoid this, change current PATTERN_DAILY to emphasize that geo_type can only be non-digit words. Something like [a-zA-Z]+ instead of the current \w+ groupings. All current geo_type and file names associated will still continue to work with this change.
- Add on to find_csv_files so we can process the new pattern accordingly.
After acquisition code refactor, refactor create_export_csv so we can output csv with the new format. create_export_csv will be used by all current indicators after https://github.com/cmu-delphi/covidcast-indicators/pull/2032 is merged.
To put everything together, for each indicator/pipeline at a time:
- Refactor indicator code so an issue date can be derived from versioned source file (if applicable) and eventually feed into create_export_csv.
- Set up acquisition cronicle job to run at night/separate from current acquisition run.
- Make sure results make sense so far by running some test scenarios on staging.

Approach 2: Use batch issue upload as is

Instead of adding a new name pattern, just use the batch issue upload mechanism we've been using for patching so far and apply it to normal daily runs.

Steps:

Refactor each indicator runner at a time so they can output normal daily runs in batch issue format. Instead of something like the current set up:
```
changehc/
receiving/
20230710_county_smoothed_adj_outpatient_cli.csv
....
```
output onto receiving can be as following for "export_dir": "./receiving"
```
changehc/
receiving/
 issue_20240906/
   chng/
     20230710_county_smoothed_adj_outpatient_cli.csv
     ...
```
The issue date here will get grabbed from versioned source files.
Adjust acquisition run accordingly.

@melange396 Which way do you think we should move forward with?

minhkhul commented 2 months ago

I'm working on approach 1. Seems more comprehensive and less shuffling current directory setups in prod, which will disrupt current runs.

minhkhul commented 1 month ago

Block by archive differ review. The archive differ saves the latest issue of a time_value. For example, assume data in this time_value/geo/signal combinations 20200506_state_deaths_7dav_cumulative_num.csvhas two versions, one posted on issue 20241009 and one posted on 20241010. Right now, our archive differ would correctly identify that there are potentially two versions of the same data, and process the data accordingly. If we bake the issue date into csv file name, like 20241009_20200506_state_deaths_7dav_cumulative_num.csv the current version of archive differ, as implemented here, won't be able to identify that the two csv 20241009_20200506_state_deaths_7dav_cumulative_num.csv and 20241010_20200506_state_deaths_7dav_cumulative_num.csv are referring to the same time_value, and would just save both to the s3 bucket, which defeats the purpose of archive differ. If we do some directory manipulation like the patching output structure, it will effect how export_dir is structured. With the way we run archive differ now, the pipeline would just break. So either way, it takes some rewriting of archive differ. And since the archive differ is going through some reviews, let's wait till that's done.

cmu-delphi / covidcast-indicators

Indicator runners should output files with issue date #1907

Approach 1: New PATTERN_ISSUE_DAILY

Steps:

Approach 2: Use batch issue upload as is

Steps: