CDCgov / cfa-epinow2-pipeline

https://cdcgov.github.io/cfa-epinow2-pipeline/
Apache License 2.0
10 stars 2 forks source link

Config generation workflow #68

Open amondal2 opened 2 days ago

amondal2 commented 2 days ago

Currently, job configurations are generated fairly manually, particularly when jobs need to be resubmitted due to state exclusions or model prior changes. While we cannot automate this entire process due to manual review, we can streamline a bit and separate concerns (configuration generation, storage, job provisioning, validation, etc.) To that end, I propose several components:

  1. A scheduled job that generates the "default" (i.e., the full set of geography-date-pathogen combinations with the default model params) configurations and writes them to a storage location. This could be in the form of a scheduled Azure Function.
  2. A microservice which validates configuration files against a schema. [Note: a v1 of this microservice has been deployed.]
  3. Configuration update/regeneration command line interface (CLI). A CLI tool that can be used to view and update configuration objects. This will interact with both the validation service and the storage to update configurations following initial model runs (e.g., state exclusions, modeling priors, etc.). We would only update the configurations to provide a minimal set required for the amended model run.
  4. A centralized storage container which contains the up-to-date (either default or user-amended) configuration files.
  5. A GH action which reads the configurations from the storage container and kicks off the modeling pipeline. This should be agnostic to how the configs were generated; it should just read the latest files from the storage container and use those to provision jobs downstream. A crude diagram of the workflow is attached: schematic

My main open question is how to handle versioning and timestamps. For example, if we generate a default set of configs for a model run, then update them via the CLI, do we overwrite the original configurations or create a new timestamped version?

amondal2 commented 2 days ago

@zsusswein @natemcintosh @kgostic would appreciate any feedback here!

natemcintosh commented 2 days ago

if we generate a default set of configs for a model run, then update them via the CLI, do we overwrite the original configurations or create a new timestamped version?

I think we would ideally create a new timestamped version. Perhaps a step up from new timestamped versions would be including a run ID that can be used to more easily link the config to the model output. In the past, we've had success using UUID4 for run ID. In a perfect world, we could use UUID7 which includes a timestamp, and hence is time sortable (best of both worlds in one: a run ID and a timestamp). That said, python does not include a way to generate UUID7 in the uuid standard library, so we'd need a library for generating them.

Otherwise, I like the flow chart and how everything could work. Perhaps the "generate default configurations" step could be done inside a Github Action for easier accessibility? That said, it would be really useful to have a working example of an Azure Function that we could use as a reference for other projects.

amondal2 commented 2 days ago

Thanks @natemcintosh! That's helpful. I'll look into uuid7 generation when we start to build that component. I'm open to suggestions on the default generation piece, Azure Functions made sense to me initially as it has built-in scheduling functionality which could remove one step of overhead (ie we could set it to run overnight before the initial modeling runs are provisioned).

zsusswein commented 2 days ago

Thanks for this Agastya! Thinking out loud a bit:

This could be in the form of a scheduled Azure Function

What do you see as the benefit of doing this in a standalone Function? We've been talking about scheduling for this pipeline through GHA so far. Do you see benefits of doing this through Functions? How would we be able to share code with the local CLI piece?

A centralized storage container which contains the up-to-date (either default or user-amended) configuration files

What do you think this looks like? Do we delete and re-write configs for every job-task into the same dedicated container every time we kick off a job? Are we able to use some kind of unique path structure and return that path back?

A GH action which reads the configurations from the storage container and kicks off the modeling pipeline.

Because one of the central bits I'm not clear on is this ^^. How does the Action know where these configs are? Are we able to hardocde a path without eventual consistency biting us (we've had a bad time....)

My other big question is how we handle needing to run less than everything. Do we tell the CLI to only write the 5 configs we need? I think it's fine (and good) if we say pieces of this question are handled upstream and out of scope here, but I want to identify what those pieces are.

For example, if we generate a default set of configs for a model run, then update them via the CLI, do we overwrite the original configurations or create a new timestamped version?

I'm also not totally following this question. We're planning on storing the configs in the model run metadata. Does that answer the question? Or are you asking about something else?

amondal2 commented 2 days ago

Thanks @zsusswein!

Wrt GHA vs Functions, I don't think it makes a huge difference. I think either way, we wouldn't want to share code with the CLI piece (it would just be a separate repo that generates and writes to storage on a schedule). I lean toward functions because we don't have to mess with doing a bunch of stuff in YAML, but if folks are all in on GHA then that's fine with me too. We could also have the GHA invoke a function on a schedule so that the actual config generating script is its own thing. Happy to hear any suggestions on this.

We'll definitely have to be careful with paths and how they are accessed. I think we could do something like what @natemcintosh mentioned above and have a uuid-like value that the kickoff script reads (the latest version) and schedules jobs based on that. How is this done currently? I would definitely prefer to not re-write or overwrite anything. Maybe we could do a filepath like <job_id>-default/ and <job_id>-amended/ if there are any updates that come from the CLI.

natemcintosh commented 2 days ago

At the moment, I believe we just name the config files by their state and disease pair, and everything gets overwritten? Or it might include the timestamp of when the file was created. I need to go check that.

I like the idea of paths like <job-id>-default/ and <job-id>-amended/. Though what do we think of scheduled instead of default?

zsusswein commented 2 days ago

I don't think it's technically unworkable (or even that hard), but I do think we need to think carefully about how to pass the config paths around. I really want to an avoid the disaster with caching and blob storage that we've had.

I think it would be good to spell out more the mechanics of how we're gluing the pieces together here. I like the general approach, but I'm still not super clear and what exists by default and what's getting created each time.

amondal2 commented 1 day ago

Sounds good, thanks both for the feedback. Will set up time to discuss!

amondal2 commented 1 day ago

I think the CLI piece of this needs a bit of thought, and to make sure it aligns with downstream work, but the first piece of this per discussions with @natemcintosh & @zsusswein will involve:

@kgostic -- if this looks okay, I can go ahead and request the repo for this initial piece of work.

kgostic commented 21 hours ago

Thanks for planning this out guys. I think this broadly looks ok. Your final summary @amondal2 doesn't mention the CLI interface, which will be essential if we need to edit the default config (e.g. after reviewing the data, or to kick off a test run). But as long as that's still in scope I'm mostly happy.

I think it's a good idea to have versioned records of the configs we've used in the past, but since the modeling pipeline writes a copy of the config to the output metadata, I think this is not the most critical part of this infrastructure. We might even be able to get away with an overwrite-based approach.

Ping me if you need anything else and thank you!

amondal2 commented 20 hours ago

Thanks @kgostic ! The CLI is definitely in scope, it'll just require a bit more thought. I'll request creation of both repos, in any case.

amondal2 commented 16 hours ago

Repos created: https://github.com/CDCgov/cfa-config-generator https://github.com/CDCgov/cfa-config-management-cli