Iterative Deployments Fail when using AVI/AKO

Hello all.

we are having a strange issue happening in our lab and are looking for some help trying to pin it down.

AKO Version - 1.7.2 AVI Controller Version = 21.1.4-9210

Issue Description - when doing CI/CD deployments of our clusters using AKO/AVI, we see that only 1 out of every 10 deployments or so "succeeds" in bringing up the VIP - meaning that its reachable beyond the cluster nodes themselves. When it "works" we see that the VSVIP, Virtual Service, Pool and SE-GROUP Engine are brought up and the VIP is routables without issue.

When it doesnt work - we see that that the VSVIP is and Virtual Service are created however, we see a couple different errors in the AVI controller:

Pools dont exist
Unable/waiting for SE Creation (and then failing on the POOLs)

Conditions - between each CI/CD run, we remove all traces of the VSVIP, Service Engine (not the group, but the SE that was broughtup), and traces of the Static Routes that were used from the Worker nodes internal network to the IPs of ETH0 of those nodes (for routing in our AVI).

if we wait a "while" a day or so - we can see that eventually this "works again"
We observe that our AKO CNI is applied and the AKO-0 POD comes up and is able to reach the AVI Controller
We observe that the VSVIP and Virtual Service is created as the LB are created in line.

Errors from AKO Logs When this issue occurs, we see Errors in the AKO PODs complaining that the Service Already exists - when as outlined, all traces (that are available to us) have been removed. We have look and implemented some code check using the go AVI SDK (to list and remove VSVIP, Virtual Service, etc) - however there is no API that lets you remove the Static Routes.

We see that AKO doesnt throw errors, but AVI complains about POOLS being down and such.

Our Thoughts it would seem that AVI Controller is 'cacheing or holding" information about previous setups somewhere that we cannot clean out, this makes doing automated end to end testing very problematic, since we never "know" when the AVI /AKO combination will work or not.

Our setup/configuration is static - doesnt change every deployment and yet, as observed, 1/10 times we deploy the path "opens correctly' - other times we have to bounce AKO pods, remove and read things in AVI controller side and its really "hard" to pin down what is the cause of this.

Our need is that we can deploy our clusters (end to end including the LB setusp for our vips) are part of our e2e testing cycle, however this instability is creating some havoc as you can imagine.

Any thoughts on where this might be "hidden" so we can remove the memory of a previous deploy and our LBs come up without issue?

avinetworks / avi-helm-charts

Iterative Deployments Fail when using AVI/AKO #169