Add a failing test to check that merge queues work

michaelkirk commented 11 months ago

Hopefully CI won't merge this. If it does, I'll revert it right away.

I'm starting to suspect that the issue is with our aggregate "ci status" job. Previously the ci status job was skipped if one of it's dependents failed. It seems like maybe "skipped" doesn't count as "failed"?

michaelkirk commented 11 months ago

Hey! I think it worked... maybe? I'm gonna try to push up a "fix" commit and see if it will merge.

michaelkirk commented 11 months ago

IT WORKED WOW.

So here's what I think is going on:

We have a lot of different job configurations. To be usefully distinguishable, the name for each job includes info about its configuration, including its rust version, the proj version, and potentially multiple features, like proj ubuntu (georust/proj-ci:proj-9.2.1-rust-1.63, --features "network bundled_proj).

The bors.toml needed to have each one of these names. Similarly, GH status checks needs to have the name of every check. But there are a lot of names! And they can change (proj version, rust version). Recording them all and keeping them all up to date wouldn't be tenable. To make this schema tenable, we added an aggregate job ci_status which can depend on matrix jobs by their job key, not their name which meant we could avoid the combinatorial name explosion of the matrix parameters.

  ci-result:    
    name: ci result    
    runs-on: ubuntu-latest    
    needs:    
      - proj-ubuntu     // NOTE: Only one entry for all of proj-ubuntu. 
      - proj-sys-ubuntu // We don't need an entry per feature-set or rust version.
      - proj-macos    
      - proj-sys-macos    
    steps:    
      - name: Mark the job as a success    
        if: success()    
        run: exit 0    
      - name: Mark the job as a failure    
        if: "!success()"    
        run: exit 1

It's the needs key which makes the aggergate ci-status job depend on the other jobs. HOWEVER, I never really paid much attention to the fact the the Mark the job as a failure step never ran. The entire job was being skipped if one of its needs failed. "skipped" is a different status than "failed".

Since bors (apparently) considers a skip a failure, this worked as expected. HOWEVER, apparently GH merge queues don't consider skipped a failure.

So the previous experience, where I was seeing failed builds merge, what was happening was:

some specific sub-job fails
The ci-status aggregate job, which depends on that sub-job is skipped
GH merge queue doesn't see the ci-status as failed, so goes ahead and merges.

I'm not even 100% sure I'm right about this, but it's my best guess at this point. And I guess in hindsight, I'm not that surprised. The aggregate job thing is kind of uncommon, so it makes sense that we might be hitting some edge case thing that other people haven't hit enough to complain commonly about.

Edit to clarify the fix:

There are now two aggregate jobs with the same name, one which always runs on failure and one which always runs on success - its opposite will be skipped.

  ci-success:
     name: ci result
     runs-on: ubuntu-latest
     if: ${{ success() }}
     needs:
       - proj-ubuntu
       - proj-sys-ubuntu
       - proj-macos
       - proj-sys-macos
     steps:
       - name: Mark the job as a success
         run: exit 0

   ci-failure:
     name: ci result
     runs-on: ubuntu-latest
     if: ${{ failure() }}
     needs:
       - proj-ubuntu
       - proj-sys-ubuntu
       - proj-macos
       - proj-sys-macos
     steps:
       - name: Mark the job as a failure
         run: exit 1

urschrei commented 11 months ago

Amazing work figuring this out. I thought we were going to have to burn down our whole CI and ~~steal~~ adapt someone else's when we found a vaguely comparable setup.

georust / proj

Add a failing test to check that merge queues work #179