Now, the cancel signal is in the step's schema, so the engine now knows whether or not the plugin supports early termination. This is a shift-left feature that will detect the problem. With the prior design, the engine would send the cancel indicator without knowing if it would do anything.
Run IDs passed deeper into the steps, which makes logs clearer, in addition to making future features easier to implement.
Log labeling, so you're aware of which component is creating the log.
Validate Compatibility implemented in the engine. This is a shift-left initiative to detect the problem earlier in the execution, so you don't find out later when the last step is run. This means that it will validate that the expressions are pointing to the right-ish thing before the dependencies can be resolved to data. For example, now it will now detect, before running steps, if you're trying to input a string into an int field. This may break stuff, so do testing. Based on my testing, if this detects it, it's more likely a bug in your workflow than a bug in validate compatibility.
Retries are implemented for deadlock checks. This fixes the bug where it thinks the workflow is deadlocked when all of the steps are transitioning to their first or second steps simultaneously. The bug previously happened about once every 25 runs, and messed up test cases.
This PR has remnants of the debugging that I did, with some help from Webb, to fix the deadlock false-positive. This means that a lot of the stage changes are re-written in a way that Webb and I thought was clearer and more intentional. It is possible that a new bug was introduced, but I have not found any in my testing.
Notes for testing:
Run all the tests once, then comment out the E2E tests in engine_test.go, and then you can run go test ./... -count 1000, which will take about 40 seconds to run them 1000 times. You can lower it to 200 for it to take 8 seconds. The reason you need to comment out the E2E tests is that they download images from quay every iteration.
Some of the lint failures are also present in the old versions, but it's enforcing it now. I'm going to do some refactoring of the code to reduce redundancy and make long-term maintenance easier.
Changes introduced with this PR
This PR adds:
stop_if
Notes for testing:
engine_test.go
, and then you can rungo test ./... -count 1000
, which will take about 40 seconds to run them 1000 times. You can lower it to 200 for it to take 8 seconds. The reason you need to comment out the E2E tests is that they download images from quay every iteration.By contributing to this repository, I agree to the contribution guidelines.