Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
327 stars 208 forks source link

CI is re-running successful skipped jobs #6466

Open mhofman opened 2 years ago

mhofman commented 2 years ago

Describe the bug

Because integration tests are label triggered, it's possible for these actions to end up running concurrently on the exact same commit. To avoid duplicated work, we use a 3rd party action to detect duplicate jobs, and wait for the results of pending jobs on identical commits. If the previous check succeeds, the actual job should be skipped and outcome carried over.

For some reason the latter part seem to no longer work correctly. While we wait until a concurrent job completes, we no longer skip if successful, and end up re-running the job, creating a worse case scenario of checking the same thing multiple times sequentially.

To Reproduce

See jobs:

Expected behavior

Correctly skip job if concurrent check is successful.

Platform Environment

mhofman commented 1 year ago

To work around this issue #8253 kept the concurrent cancellation config introduced during our stint with GitHub merge queues. However this is also inefficient in some cases, e.g. when a PR label is changed while the integration tests are still running.

frazarshad commented 3 months ago

@mhofman i've tried replicating this issue by removing the concurrency key we have in integration.yml This results in the use of check_and_cancel job in pre-check-integration.yml. but it seems to working as expected. Here is the first job which completes: https://github.com/Agoric/agoric-sdk/actions/runs/10303452336 And here is the job that runs after it: https://github.com/Agoric/agoric-sdk/actions/runs/10303457883 It waits for the result as expected.

On a side note: Is our goal with this job to save time or resources? if our goal is to save resources spent on github actions, wouldnt it be more efficient to have the previous job cancel? since two jobs running side by side will use up more github action minutes

mhofman commented 3 months ago

but it seems to working as expected.

Oh that's wonderful, it looks like it doesn't re-run the job! Maybe it was a fluke. We should keep an eye on this (do we have a graph showing the runtime of the integration checks we could monitor?)

On a side note: Is our goal with this job to save time or resources?

The goal was to save time at the expense of resources. Integration jobs are long, and let's say you're into 30 minutes of a 45 minute job, apply a label unrelated (e.g. a classification label), currently we would cancel the 30 min deep job and restart from scratch. The goal was to keep that going if the PR content itself didn't change. I believe this would also happen in a much more common case of updating a PR when entering the merge queue: