Closed michaelkirk closed 11 months ago
Hey! I think it worked... maybe? I'm gonna try to push up a "fix" commit and see if it will merge.
IT WORKED WOW.
So here's what I think is going on:
We have a lot of different job configurations. To be usefully distinguishable, the name for each job includes info about its configuration, including its rust version, the proj version, and potentially multiple features, like proj ubuntu (georust/proj-ci:proj-9.2.1-rust-1.63, --features "network bundled_proj)
.
The bors.toml needed to have each one of these names. Similarly, GH status checks needs to have the name of every check. But there are a lot of names! And they can change (proj version, rust version). Recording them all and keeping them all up to date wouldn't be tenable. To make this schema tenable, we added an aggregate job ci_status
which can depend on matrix jobs by their job key, not their name which meant we could avoid the combinatorial name explosion of the matrix parameters.
ci-result:
name: ci result
runs-on: ubuntu-latest
needs:
- proj-ubuntu // NOTE: Only one entry for all of proj-ubuntu.
- proj-sys-ubuntu // We don't need an entry per feature-set or rust version.
- proj-macos
- proj-sys-macos
steps:
- name: Mark the job as a success
if: success()
run: exit 0
- name: Mark the job as a failure
if: "!success()"
run: exit 1
It's the needs
key which makes the aggergate ci-status
job depend on the other jobs. HOWEVER, I never really paid much attention to the fact the the Mark the job as a failure
step never ran. The entire job was being skipped if one of its needs
failed. "skipped" is a different status than "failed".
Since bors (apparently) considers a skip a failure, this worked as expected. HOWEVER, apparently GH merge queues don't consider skipped a failure.
So the previous experience, where I was seeing failed builds merge, what was happening was:
ci-status
aggregate job, which depends on that sub-job is skippedci-status
as failed, so goes ahead and merges.I'm not even 100% sure I'm right about this, but it's my best guess at this point. And I guess in hindsight, I'm not that surprised. The aggregate job thing is kind of uncommon, so it makes sense that we might be hitting some edge case thing that other people haven't hit enough to complain commonly about.
Edit to clarify the fix:
There are now two aggregate jobs with the same name, one which always runs on failure and one which always runs on success - its opposite will be skipped.
ci-success:
name: ci result
runs-on: ubuntu-latest
if: ${{ success() }}
needs:
- proj-ubuntu
- proj-sys-ubuntu
- proj-macos
- proj-sys-macos
steps:
- name: Mark the job as a success
run: exit 0
ci-failure:
name: ci result
runs-on: ubuntu-latest
if: ${{ failure() }}
needs:
- proj-ubuntu
- proj-sys-ubuntu
- proj-macos
- proj-sys-macos
steps:
- name: Mark the job as a failure
run: exit 1
Amazing work figuring this out. I thought we were going to have to burn down our whole CI and steal adapt someone else's when we found a vaguely comparable setup.
Hopefully CI won't merge this. If it does, I'll revert it right away.
I'm starting to suspect that the issue is with our aggregate "ci status" job. Previously the ci status job was skipped if one of it's dependents failed. It seems like maybe "skipped" doesn't count as "failed"?