Last year's production pipeline relied on manual API pulls (every Saturday and Monday) from the DCIPHER platform. It would be ideal if we set up an automated ETL pipeline to save time-stamped vintaged wastewater datasets from flu and COVID to Azure blob storage, which would allow us to perform retrospective evaluation on model performance across historic dates properly. This is particularly important because the data does not contain a field for report date, and does not have a consistent reporting lag across jurisdictions and wastewater treatment plants.
Requirements
[ ] set up a daily chron job to pull raw data from DCIPHER (I believe this would be stored in "bronze" per @natemcintosh's and NNH's ETL frameworks)
[ ] some set of minimal transformations/cleaning of the data (I believe this is bronze-> silver step)
[ ] prep data for input into MSR model/wwinference model. Silver -> Gold.
@amondal2 @kgostic @dylanhmorris @damonbayer perhaps we could kick off with a 30 minute meeting to plan out how we'd want to divide up tasks.
Goal
Last year's production pipeline relied on manual API pulls (every Saturday and Monday) from the DCIPHER platform. It would be ideal if we set up an automated ETL pipeline to save time-stamped vintaged wastewater datasets from flu and COVID to Azure blob storage, which would allow us to perform retrospective evaluation on model performance across historic dates properly. This is particularly important because the data does not contain a field for report date, and does not have a consistent reporting lag across jurisdictions and wastewater treatment plants.
Requirements
@amondal2 @kgostic @dylanhmorris @damonbayer perhaps we could kick off with a 30 minute meeting to plan out how we'd want to divide up tasks.