Ensuring that the reviewed ChangeSet is what's deployed

pbrisbin commented 1 year ago

Currently, we use Stackctl in our gitops repository such that the single ci.yml workflow runs for both PR and push (to main) events. On PR events, it runs stackctl changes and comments the changeset details on the PR for review. When that PR merges, the push event runs stackctl deploy, which creates and deploys a new changeset.

With this flow, it's possible for there to be drift introduced between review/approve and merge/deploy that results in changes deploying without review. This gap can be mitigated or exacerbated by various behaviors, and we do our best, but it always conceptually exists.

Beginnings of a fix

I've got some local work which makes it possible (and quite ergonomic) to address this at the CLI level:

# On PR event you create/review and record the id somehow
changeset=$(stackctl changes --format=id)
stackctl changes --format=md --change-set-id "$changeset"

# On push you can load and deploy that exact changeset
stackctl deploy --change-set-id "$changeset" --no-confirm

But I'm now wondering if a) we can make this work on GHA and b) if we even want it.

These options and behaviors could be useful in general, but I'm not sure if it's worth the complexity to merge it depending on those questions, so I'm opening this Issue to lay it out for discussion and the historical record.

GHA Blockers

I can't exactly determine how to store the $changeset from the PR workflow, such that we can deploy it from the (eventual) push event. The latter has no knowledge of the former. We can't, in the push event, say "what PR merged to cause this push?" If we could, we could locate that PR's workflow, and its artifacts, and load the changeset that was reviewed for deploy. Precarious, but feasible.

The other option is to use the PR merged event for the deploy workflow. I don't know in what context that workflow runs (the PR or the push), but we could explicitly check out main and then Do The Thing. Even more precarious, but feasible?

Maybe it's fine as-is?

However, the delay between review-time and deploy-time will still be an issue when we explicitly deploy reviewed changesets. What if it's "stale"? What if another PR has created a newer changeset and then merged and deployed? What kind of failure modes do we introduce when we ask CloudFormation to deploy changesets out of order? Will the diffs commute? Will it error in a confusing way? Will it through our infra into a bad state?

The default behavior is to clean up any existing changesets when deploying one, so that will happen and we may get changeset-not-found errors if another deploy cleaned up ours, but only if the two are operating on the same stack. Is that a Good Thing or more confusing due to intermittence? I suppose we rely on the repository not being too busy now, so is this version of it needing to not be too busy an improvement?

Perhaps, producing fresh changesets and deploying them on merge, even if it risks un-reviewed changes deploying, is the least of all evils. I wonder how terraform plan (at PR time) vs apply (at push time) works. Is it effectively the same in that it makes a plan at PR time then replans and applies it at push time? If so, I'd feel alright with what we're doing as it's equivalent. If not, perhaps the terraform GitHub actions can show us how to store a plan at PR time to execute later and/or if there's any "staleness" checking.

pbrisbin commented 1 year ago

OK, so doing this with terraform works similarly...

You can see in the PR workflow that it uses create-run with plan_only to get a plan and present it for review. Then in the push workflow it uses create-run again, which creates a fresh plan, and applies it.

So there's precedent for our current behavior.

pbrisbin commented 3 months ago

I'm satisfied that current behavior is as good as it needs to be. If terraform makes this concession, we might as well.

freckle / stackctl