Investigate why TestBlueGreenPromoteFull upstream E2E test is failing when run against Argo Rollouts installed by Argo Rollouts operator

At present, we are running the upstream Argo Rollouts E2E automated tests against argo-rollouts-manager PRs. With each PR, we:

A) install and start Argo Rollouts manager
B) clone the latest version of Argo Rollouts
C) Call `make test-e2e' in argo-rollouts repo to run the Argo Rollouts E2E tests, cloned from the previous step.
- Argo Rollouts then runs the e2e tests via gotestsum, a utility which intelligently runs go automated tests (and can, for example, automatically retry tests).
D) Scan the results and ensure they pass.

When the Argo Rollouts tests run again our operator, most tests pass! But some fail: for example, here is a list of failures from a recent E2E test run:

--- FAIL: TestAPISIXSuite/TestAPISIXCanarySetHeaderStep (0.48s)
--- FAIL: TestAPISIXSuite/TestAPISIXCanarySetHeaderStep (0.68s)
--- FAIL: TestAPISIXSuite/TestAPISIXCanarySetHeaderStep (0.69s)
--- FAIL: TestFunctionalSuite/TestBlueGreenPromoteFull (2.74s)
--- FAIL: TestFunctionalSuite/TestBlueGreenPromoteFull (2.85s)
--- FAIL: TestFunctionalSuite/TestBlueGreenPromoteFull (2.89s)
--- FAIL: TestFunctionalSuite/TestBlueGreenPromoteFull (2.90s)
--- FAIL: TestFunctionalSuite/TestBlueGreenPromoteFull (2.92s)
--- FAIL: TestFunctionalSuite/TestBlueGreenPromoteFull (3.25s)
--- FAIL: TestFunctionalSuite/TestControllerMetrics (0.13s)
--- FAIL: TestFunctionalSuite/TestControllerMetrics (0.17s)
--- FAIL: TestFunctionalSuite/TestControllerMetrics (0.18s)

(source)

TestControllerMetrics we expect to fail: the test expects that the Argo Rollouts controller is running locally (via make start-e2e), whereas in this case it's running in a Pod on the cluster.

However, it's not clear why TestBlueGreenPromoteFull is failing: I've glanced over the test and everything it's doing seems like it should work (and often does work, on first run).

So, this issue is to investigate why it's failing. This is also a good opportunity to dig in to Rollouts code, both the controller code and the test code.

To Reproduce:

To run a single upstream Rollouts E2E test, in hack/run-upstream-argo-rollouts-e2e-tests.sh:
Modify make test-e2e to E2E_TEST_OPTIONS="-run 'TestFunctionalSuite' -testify.m 'TestBlueGreenPromoteFull'" "until-fail.sh" make test-e2e
- This will run the TestBlueGreenPromoteFull test over and over, until it fails.
Use the following until-fail.sh script: https://gist.github.com/jgwest/7048a765d398519837f990120cf3fdd0
Then run hack/run-upstream-argo-rollouts-e2e-tests.sh

Strangely, what I have seen is this TestBlueGreenPromoteFull will initially pass a few times, but after a few runs it will switch to always failing, 100% of the time.

argoproj-labs / argo-rollouts-manager

Investigate why TestBlueGreenPromoteFull upstream E2E test is failing when run against Argo Rollouts installed by Argo Rollouts operator #48

To Reproduce: