Production Pipeline Diagram

damonbayer commented 1 month ago

This will always contain the most up-to-date draft of the pipeline.

flowchart TD
    prep_forecast["Prepare Forecast Data<br/>(Python using Polars)"]
    prep_retro["Prepare Retro Data<br/>(Python using Polars)<br/>(currently part of the Prepare Forecast Data script)"]
    report["Report Generator<br/>(does different things depending on inputs)<br/>(Quarto)"]
    joint_hub_output["Joint hub submission<br/>(csv)"]
    report_output("[Lightweight  or Detailed] [Retro or Forecast] Report<br/>(HTML)")

    combine_hub_output("Combine hub submissions<br/>(Python using Polars)")

    subgraph "For each location, in parallel"
        fit_model["Fit Model<br>(Python using PyRenew)"]
        forecast["Forecast, prior predictive, and posterior predictive <br>(Python using PyRenew)"]
        tidy["Tidy<br>(Python using forecasttools)"]
        interest_figs["Summarize quantities of interest<br/>(R using tidybayes)"]
        score["score<br/>(R using scoringutils)"]
        hub["Format for hub submission<br/>(Python using forecasttools)"]
        diagnostics["ArviZ Diagnostics<br/>(Python using ArviZ)"]

        data_fit("data for fit<br/>(json)")
        posterior_draws("Posterior MCMC draws<br/>(pickle)")
        all_mcmc_draws_ncdf("All MCMC draws with date coordinates<br/>(netCDF)")
        all_mcmc_draws_tab("All MCMC draws<br/>(parquet)")
        hub_output("Hub submission<br/>(csv)")
        retro_data("Retro data<br/>(tsv)")
        interest_figs_output("Tables and figures for quantities<br/>(parquet/tsv and svg/png)")
        diagnostics_output("Diagnostic tables and figures<br/>(parquet/tsv and svg/png)")
        scored("scored dataset<br/>(Parquet)")
    end

    prep_retro --> retro_data
    prep_forecast --> data_fit
    data_fit --> fit_model
    fit_model --> posterior_draws
    posterior_draws --> forecast
    forecast --> all_mcmc_draws_ncdf
    all_mcmc_draws_ncdf --> tidy
    tidy --> all_mcmc_draws_tab

    hub --> hub_output
    retro_data --> report    
    all_mcmc_draws_tab --> interest_figs
    all_mcmc_draws_tab --> score
    score --> scored
    all_mcmc_draws_tab --> hub
    all_mcmc_draws_ncdf --> diagnostics
    diagnostics --> diagnostics_output
    interest_figs --> interest_figs_output
    retro_data --> score

    diagnostics_output --> report
    interest_figs_output --> report
    scored --> report
    report --> report_output

    hub_output --> combine_hub_output
    combine_hub_output --> joint_hub_output
    %% Styling
    classDef script fill:;
    classDef file fill:#0099ff;

    class data_fit,posterior_draws,all_mcmc_draws_ncdf,all_mcmc_draws_tab,retro_data,interest_figs_output,diagnostics_output,scored,report_output,hub_output,joint_hub_output file
    class prep_forecast,prep_retro,fit_model,forecast,tidy,interest_figs,score,hub,diagnostics,report,combine_hub_output script

damonbayer commented 1 month ago

Here is what I am thinking for the necessary scripts (gray rectangle) and the outputs (blue rounded rectangles). Open to any feedback on this design. Scoring is probably not really part of the production pipeline, but I've included it anwyay.

Some question I can predict and answer:

What file format will we use for the tabular data? csv or parquet

Why is there a separate output of "all MCMC draws" as a tabular file? I have not found the R packages for working with netCDF data to be very friendly. I have not looked into using zarr yet.

Why not directly use the Posterior MCMC draws for ArviZ diagnostics We could, but I think it is probably more user-friendly to import a netCDF and we don't want to create an additional netCDF after fitting the model.

Why use R at all? Since we have to use R to use scoringutils, we might as well take advantage of the CFAEpiNow2Pipeline and ggdist packages.

@dylanhmorris @AFg6K7h4fhy2 open to comments and questions

damonbayer commented 1 month ago

Current version has received a LGTM from @dylanhmorris in a Teams discussion.

kaitejohnson commented 1 month ago

@damonbayer This looks really good! Just a few questions.

Are you planning on implementing this via a make file?
What format will the Arviz Diagnostics be in? Is this a pdf with the calibration and forecasts for each state + convergence diagnostics?
Will you have any diagnostic flags implemented?
Will this be easily extendable to fitting multiple model types e.g. pyrenew with and without wastewater?
Where do the azure self-hosted runners come in? (just in the Fit model section)?

damonbayer commented 1 month ago

Thanks @kaitejohnson. I confess I haven't thought much about these questions until you asked them.

Are you planning on implementing this via a make file?

This will all be done in azure, which, I think, is not really related to make. Perhaps you mean something more general or I am misunderstanding how things work.
What format will the Arviz Diagnostics be in? Is this a pdf with the calibration and forecasts for each state + convergence diagnostics?

I think there will be a short automated report with all the forecasts and a more detailed one with the forecasts and typical mcmc diagnostics (rhat, ess, etc.) Adding calibration and other diagnostics specific to the task at hand would be good too. I think this will be an html file generated in quarto?
Will you have any diagnostic flags implemented? Hadn't thought about this yet. It would be good to lean on your experience from last season.
Will this be easily extendable to fitting multiple model types e.g. pyrenew with and without wastewater? That's the plan, but I can't say much more until I have more experience running it for a single model.
Where do the azure self-hosted runners come in? (just in the Fit model section)? Unclear. I need hear more about how NNH is using them.

kaitejohnson commented 1 month ago

This will all be done in azure, which, I think, is not really related to make. Perhaps you mean something more general or I am misunderstanding how things work.

I guess what I mean is there any plan to use a pipelining tool of some sort that will cache different steps. So that you can make adjustments (e.g. excluding certain data points) and rerun the pipeline with only the pieces downstream getting updated (this is what we used targets for last year and despite it not playing nicely with azure it was really convenient for automating pipeline outputs)

Like the idea of an HTML file generated in quarto --I found it really helpful to have a few things to review in one place as a post model run pre send-off step to spot check each location

Hadn't thought about this yet. It would be good to lean on your experience from last season.

Per usual, I took NNH's lead on this and used their thresholds for rhat, divergences, EMBFI, etc. which are now defaults in the wastewater package's model flags. https://github.com/CDCgov/ww-inference-model/blob/9dd766b8da3cd661f7daeb5f6f6127786e4db5ec/R/model_diagnostics.R#L51 I don't think you need these exactly but having some flags is helpful to know where to look in real-time.

damonbayer commented 1 month ago

I guess what I mean is there any plan to use a pipelining tool of some sort that will cache different steps. So that you can make adjustments (e.g. excluding certain data points) and rerun the pipeline with only the pieces downstream getting updated (this is what we used targets for last year and despite it not playing nicely with azure it was really convenient for automating pipeline outputs)

Seems like a good question for @dylanhmorris. I think adding functionality to kick off one script from another is trivial (e.g. if you already fit the models, but want to change the forecast horizon, you could kick off the pipeline starting at the forecasting step). Maybe there are more sophisticated concepts in azure that could make this easy to implement.

damonbayer commented 1 month ago

@kaitejohnson I have updated the diagram with the quarto reports idea.

damonbayer commented 1 month ago

@dylanhmorris I have updated the diagram based on our conversation about a step to tidy arviz data into a tabular format

SamuelBrand1 commented 1 month ago

Hey @damonbayer I think the quantities of interest box should flow into the scoring box rather than the MCMC draws. The reason being that that includes our observables from the retro data.

TBF I can imagine also scoring all parameters (e.g. when fitting on generated data).

SamuelBrand1 commented 1 month ago

From f2f discussion it was pointed out that "quantities of interest" doesn't mean "generated quantities", it means summary statistics.

damonbayer commented 1 month ago

@SamuelBrand1 @dylanhmorris @AFg6K7h4fhy2 I have updated the diagram based on our f2f discussion. Please thumbs up this comment if it appears accurate or comment if it does not.

damonbayer commented 4 weeks ago

It has come to my attention that there is some ambiguity around the "Tidy (Python using forecasttools)" step. My intention is that this is (probably) several parquet files (one for each InferenceData group) that adhere to the tidybayes::tidy_draws format:

A data frame (actually, a tibble) with a .chain column, .iteration column, .draw column, and one column for every variable

https://mjskay.github.io/tidybayes/reference/tidy_draws.html

CDCgov / pyrenew-hew

Production Pipeline Diagram #32