CI is re-running successful skipped jobs

mhofman commented 2 years ago

Describe the bug

Because integration tests are label triggered, it's possible for these actions to end up running concurrently on the exact same commit. To avoid duplicated work, we use a 3rd party action to detect duplicate jobs, and wait for the results of pending jobs on identical commits. If the previous check succeeds, the actual job should be skipped and outcome carried over.

For some reason the latter part seem to no longer work correctly. While we wait until a concurrent job completes, we no longer skip if successful, and end up re-running the job, creating a worse case scenario of checking the same thing multiple times sequentially.

To Reproduce

See jobs:

Expected behavior

Correctly skip job if concurrent check is successful.

Platform Environment

Github Actions

mhofman commented 1 year ago

To work around this issue #8253 kept the concurrent cancellation config introduced during our stint with GitHub merge queues. However this is also inefficient in some cases, e.g. when a PR label is changed while the integration tests are still running.

frazarshad commented 3 months ago

@mhofman i've tried replicating this issue by removing the concurrency key we have in integration.yml This results in the use of check_and_cancel job in pre-check-integration.yml. but it seems to working as expected. Here is the first job which completes: https://github.com/Agoric/agoric-sdk/actions/runs/10303452336 And here is the job that runs after it: https://github.com/Agoric/agoric-sdk/actions/runs/10303457883 It waits for the result as expected.

On a side note: Is our goal with this job to save time or resources? if our goal is to save resources spent on github actions, wouldnt it be more efficient to have the previous job cancel? since two jobs running side by side will use up more github action minutes

mhofman commented 3 months ago

but it seems to working as expected.

Oh that's wonderful, it looks like it doesn't re-run the job! Maybe it was a fluke. We should keep an eye on this (do we have a graph showing the runtime of the integration checks we could monitor?)

On a side note: Is our goal with this job to save time or resources?

The goal was to save time at the expense of resources. Integration jobs are long, and let's say you're into 30 minutes of a 45 minute job, apply a label unrelated (e.g. a classification label), currently we would cancel the 30 min deep job and restart from scratch. The goal was to keep that going if the PR content itself didn't change. I believe this would also happen in a much more common case of updating a PR when entering the merge queue:

push a small fixup commit and immediately apply the automerge label
all regular and integration checks would start, this would run on a simulated merge of the PR into the target
once the regular required checks would pass, the PR would enter the merge queue, and mergify would rebase onto the target (rebase strategy) or merge the target branch (squash strategy)
all regular and integration checks would start on a simulated merge of this new commit into the target branch.
For integration checks, it would realize that the content is the exact same, and would simply "watch" the status of the still running check before the rebase, then carry over the result of that check, avoiding having to restart the check from the beginning.

Agoric / agoric-sdk