Open melange396 opened 1 year ago
see my novelesque slack message about applying this to the hhs
indicator, whose source data includes an issue
date that can be carried through to the output files.
Note that this change is necessary but probably not sufficient for making our process resilient to assuming an "issue" date of now/today; parts of the acquisition process, including the input file directory structure and the post-process file handling behaviors will need to change, along with similar/analogous changes to the things in Cronicle that move these files around and trigger acquisition runs.
There are 2 ways I believe we can go about this.
Right now we use these naming patterns to get the right time_value
, signal
, geo_type
. Issue value is just the immediate date when data was processed. So let's add a new issue-date-included filename format and use it.
This includes adding on to acquisition code ways to process csv file names based on this new filename format and add on to export_to_csv to allow exporting csv files with this new name pattern. Then move indicators one by one to using this new format.
CsvImporter
:
#.../source/issueyyyymmdd_timevalueyyyymmdd_geotype_signal_name.csv
PATTERN_ISSUE_DAILY=r'^.*/([^/]*)/(\d{8})_(\d{8})_(\w+?)_(\w+)\.csv$'
This pattern basically just add issue date to the front of the existing PATTERN_DAILY
:
#.../source/timevalueyyyymmdd_geotype_signal_name.csv)
PATTERN_DAILY=r'^.*/([^/]*)/(\d{8})_(\w+?)_(\w+)\.csv$'
20240907_20240901_geo_signal_name.csv
would be caught by both PATTERN_DAILY
and PATTERN_ISSUE_DAILY
. So to avoid this, change current PATTERN_DAILY
to emphasize that geo_type
can only be non-digit words. Something like [a-zA-Z]+
instead of the current \w+
groupings. All current geo_type
and file names associated will still continue to work with this change.create_export_csv
will be used by all current indicators after https://github.com/cmu-delphi/covidcast-indicators/pull/2032 is merged.Instead of adding a new name pattern, just use the batch issue upload mechanism we've been using for patching so far and apply it to normal daily runs.
changehc/
receiving/
20230710_county_smoothed_adj_outpatient_cli.csv
....
output onto receiving can be as following for "export_dir": "./receiving"
changehc/
receiving/
issue_20240906/
chng/
20230710_county_smoothed_adj_outpatient_cli.csv
...
The issue date here will get grabbed from versioned source files.
@melange396 Which way do you think we should move forward with?
I'm working on approach 1. Seems more comprehensive and less shuffling current directory setups in prod, which will disrupt current runs.
Block by archive differ review. The archive differ saves the latest issue of a time_value. For example, assume data in this time_value/geo/signal combinations 20200506_state_deaths_7dav_cumulative_num.csvhas two versions, one posted on issue 20241009 and one posted on 20241010. Right now, our archive differ would correctly identify that there are potentially two versions of the same data, and process the data accordingly. If we bake the issue date into csv file name, like 20241009_20200506_state_deaths_7dav_cumulative_num.csv the current version of archive differ, as implemented here, won't be able to identify that the two csv 20241009_20200506_state_deaths_7dav_cumulative_num.csv and 20241010_20200506_state_deaths_7dav_cumulative_num.csv are referring to the same time_value, and would just save both to the s3 bucket, which defeats the purpose of archive differ. If we do some directory manipulation like the patching output structure, it will effect how export_dir is structured. With the way we run archive differ now, the pipeline would just break. So either way, it takes some rewriting of archive differ. And since the archive differ is going through some reviews, let's wait till that's done.
Indicator runners should output CSV files with issue date, wherever possible.
Most indicators (if not all of the currently active indicators) output CSV files without an "issue date" saved/encoded anywhere in or around them... They assume the issue date is "today", and that the files will be ingested into the database the same day (our acquisition process also assumes an issue date of "today" (by default) upon reading these files). This can lead to inaccurate "issue" columns when the data finally makes it to the database, if the acquisition job(s) are broken, backed up by a long queue, or otherwise delayed.
If we export with an explicit issue date, it does not matter when the files are consumed, the "issue" should still be accurate. In fact, this can make it so re-importing the same CSV files multiple times is an idempotent operation. It will help us when there are problems with our systems in real-time (as listed above), plus it will simplify things if we need to import CSV files at some later date (such as adding new data files on top of a restored database snapshot). This can also be useful when the external data source specifies an issue date explicitly.
There is a provision in our acquisition process to use an issue date that is taken from the directory structure. The "nowcast" indicator seems to be able to produce this directory structure, but AFAICT this indicator is not being run successfully anywhere at present.