CDCgov / cfa-epinow2-pipeline

https://cdcgov.github.io/cfa-epinow2-pipeline/
Apache License 2.0
10 stars 2 forks source link

Conditionally create the pool if non-existing #67

Closed gvegayon closed 1 month ago

gvegayon commented 1 month ago

This pull request includes several enhancements and modifications to the GitHub Actions workflow file .github/workflows/1_pre-Test-Model-Image-Build.yaml. The changes focus on improving the build process, adding commit message handling, and enhancing the Azure Batch pool creation logic.

Azure Batch pool creation:

[!IMPORTANT] Trying to pre-clean the job may make the workflow fail b/c the pool takes longer to be deleted than the time between submitting the deletion instruction and trying to build it. Pools are deleted only at the end step (if found), so deletion cannot happen in the first commit to the branch.

Enhancements to build process:

Commit message handling:

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Additional details and impacted files

:loudspeaker: Thoughts on this report? Let us know!

gvegayon commented 1 month ago

@zsusswein and @natemcintosh, this should fix the errors you are getting in other PRs (or reduce them!)

zsusswein commented 1 month ago

Will get to this today! Got caught up in production yesterday

jkislin commented 1 month ago

@gvegayon this is tremendous. Thanks again! I don't have time to do a proper review until next week (I'd want to test and watch Azure Batch Explorer and Github Actions a few times, which I can do), but if all checks are passing and Zach and Nate think it looks good, I'll defer to them.

natemcintosh commented 1 month ago

This is going to make future scheduled work with Azure Batch so much simpler!

Looking beyond the scope of this PR, maybe we could generalize some of these steps. It quite likely that we'll need to do very similar things in the future for other repos, which would probably require a lot of very careful copy and paste. To reduce worry about getting something wrong in that process, maybe we could make a reusable Github action for doing these steps? Something like

- name: Refresh pool
  uses: CDCGov/refresh-pool@v1
  with: |
    arg1: ...
    arg2: ... 

That said, I've never created an action before, and don't know what the process might look like.

zsusswein commented 1 month ago

Looking beyond the scope of this PR, maybe we could generalize some of these steps.

I like this idea, but think it should be beyond the scope of this PR. Nate, would you mind making this an issue instead?

EDIT: #72

zsusswein commented 1 month ago

I also want to flag for @natemcintosh that we should think about how to handle the case of the pool running a stale version of the tagged image. As in:

  1. We push branch feat-xyz
  2. Actions creates image cfa-epinow2-pipeline:feat-xyz
  3. Actions creates pool cfa-epinow2-pipeline:feat-xyz (which in the future will be linked to the image from (2) but is not yet pending #59)
  4. We make some commits and push
  5. Actions updates image build and pushes new layers, pointing image with tag cfa-epinow2-pipeline:feat-xyz to the new layers
  6. We don't tear down the pool. On running a job, does the node in the pool from (3) spin up container layers from (2) or from (5)? If we want (5) do we need to tear down and rebuild?

I think this is unanswerable in our current setup (pending #59). So I'm mainly flagging this is a weak spot for us to keep an eye on.

natemcintosh commented 1 month ago

Good question. Once things get merged, I'd say we should try it out by making some change that we can easily check the output of.

gvegayon commented 1 month ago

@zsusswein, here are some answers to your questions:

  • How would this play with Clean up on PR close #62? I still want to leave that functionality for a separate PR but would it be as simple as adding an additional or clause to the pool delete condition? If it's harder, are there any changes that can be made here to make implementing it simpler?

I would add that as a separate workflow. I can deal with #62 after this is merged.

  • How easy do you think it would be to add polling for pool deletion completion via az batch pool show? I think we'd want the runner to hang until deletion is successful so that the runner doesn't report success unless the pool is actually deleted. Feel free to shunt this to a new issue.

I think this could be done, but it sounds like a separate action for CDCgov/cfa-actions! Will add the issue there. Mostly because it will involve writing a program that loops checking whether the pool deletion was complete, and that could be something useful for others.

  • Would it be possible to have a bot leave a comment on the PR with the current state of the linked pool?

Not with the current, but with the last run. It could be something similar to what's going on in https://github.com/CDCgov/cfa-actions/pull/1. I suggest creating a separate issue for this.

  • Can you add a description of your new functionality and how to trigger pool deletion/recreation to the readme?

Sure. I'll add that to the readme (or a readme).

gvegayon commented 1 month ago

I was just adding a mermaid diagram, @zsusswein. Here it is:

flowchart LR

  START((Start))---DEPS_CACHED

  DEPS_CACHED{Deps<br>cached?}---|No|DEPS
  DEPS_CACHED---|Yes|IMG

  subgraph DEPS[Job01-build_image_dependencies]
    direction TB
    Dockerfile-dependencies---|Generates|DEPS_IMAGE[Dependencies<br>Image]
  end

  DEPS---IMG

  subgraph IMG[_01_build-model-image]
    direction TB
    Dockerfile---|Generates|PKG_IMG[Package<br>Image]
  end

  IMG---POOL

  subgraph POOL[_02_create-batch-pool-and-submit-jobs]
    direction TB

    POOL_EXISTS{Is the pool<br>up?}
    POOL_EXISTS---|No|CREATE_POOL[Create the pool]
    POOL_EXISTS---|Yes|DELETE_POOL{Commit includes<br>'delete pool'}
    DELETE_POOL---END_POOL((End))
    CREATE_POOL---END_POOL

  end

If you like it, I can PR it

zsusswein commented 1 month ago

Love it! Please go for the PR.