Ping Slack when CI/CD workflow fails on `main`

sarayourfriend commented 1 year ago

Problem

It's possible, due to semantic merge conflicts, for CI/CD to pass in a PR, but then fail once merged to main. This is specifically caused by non-text based merge conflicts ("semantic" or "conceptual" conflicts). For example, take the following possibility:

PR A adds a new usage of a function fly_rocket that originally accepted a string ID for the rocket. PR B updates fly_rocket to accept the entire Rocket object rather than the ID. PR B updates all existing usages.

PR B gets merged first. PR A does not need to rebase because there are no conflicts as the usage in PR A is brand new. When PR A gets merged, the commits from A will be rebased onto main, which now includes the updated fly_rocket from PR B. This causes a "semantic" or "conceptual" conflict that did not result in any git conflicts (which would have prevented the merge).

Description

In these cases, it can be easy to miss the fact that the build on main is failing. We can add a Slack alert when things like lint (which checks types) and the image builds fail on main. For those jobs, something like the following step could work:

- name: Alert failure
  if: failure()
  run: send_slack_alert "Beep boop, `main` is borked!"

Additional context

This type of issue has happened in our repositories before, so it's not a far-fetched scenario. It has happened with Django migrations and with TypeScript types that I know of.

Alternatives

Require rebasing a PR before merging. GitHub has a feature that we can enable to enforce this. However, it can be pretty annoying for contributors, especially as it is only meant to prevent this one edge-case of issue. Maybe it is worth it though?

dhruvkb commented 1 year ago

I like this idea a lot. Such failures can indeed be possible and currently the only notification we get is if a deploy starts and fails, not if the deploy step is never reached at all.

On a lighter note, I would love it if the message text in the implementation remains "Beep boop, main is borked!" 😂.

zackkrida commented 1 year ago

Require rebasing a PR before merging. GitHub has a feature that we can enable to enforce this. However, it can be pretty annoying for contributors, especially as it is only meant to prevent this one edge-case of issue. Maybe it is worth it though?

Given our PR merge strategy of squashing everything, we could also just require that branches be up to date, right? Specifically merge commits would be squashed and might be a bit easier for new contributors. GitHub also shows UI for this now, which would help:

I would personally be comfortable with adding the requirement and setting up some alert(s), in case staging deploys fail for any non-code change reasons as well.

sarayourfriend commented 1 year ago

Given our PR merge strategy of squashing everything, we could also just require that branches be up to date, right? Specifically merge commits would be squashed and might be a bit easier for new contributors. GitHub also shows UI for this now, which would help:

We could: I just find it very annoying if say, three people are trying to merge PRs at the same time, if you're the "slowest" to do the rebase and pull the trigger, then you have to wait for CI/CD to run two times again. This can take a lot of attention if you're just trying to merge a PR. I don't run into this often due to my timezone but there are days when I see several PRs get merged all around the same time.

zackkrida commented 1 year ago

That's a great point. Lots of merges happen at the same time weekly during the community meeting!

AetherUnbound commented 1 year ago

We discussed this in the public contributor meeting today. I had the same concern as @sarayourfriend, but it seems like that case would be irrelevant if we also added merge queues. The queue would automatically handle rebasing and ensuring CI/CD passes before merging the code, meaning maintainers would not need to babysit merges if having an updated copy was required. I would support adding required update if merge queues also accompanied it!

sarayourfriend commented 1 year ago

Same as @AetherUnbound here. However, I still think we should do this issue regardless: things can fail on main and we should still be aware of it, even if everything in the PR goes well. For example, what if we start getting rate limited by providers during tests again, or other similar nebulous/out of our control things? We'll notice on PRs eventually but we should know as soon as main is red.

Adding the merge queues and rebase requirement should be in a separate issue in the infrastructure repository (as it should be added to the repo configuration there), right?

AetherUnbound commented 1 year ago

Great point, that sounds appropriate!

WordPress / openverse