avinetworks / avi-helm-charts

Avi Networks Helm Charts
14 stars 26 forks source link

Iterative Deployments Fail when using AVI/AKO #169

Open lmcdasm opened 1 year ago

lmcdasm commented 1 year ago

Hello all.

we are having a strange issue happening in our lab and are looking for some help trying to pin it down.

AKO Version - 1.7.2 AVI Controller Version = 21.1.4-9210

Issue Description - when doing CI/CD deployments of our clusters using AKO/AVI, we see that only 1 out of every 10 deployments or so "succeeds" in bringing up the VIP - meaning that its reachable beyond the cluster nodes themselves. When it "works" we see that the VSVIP, Virtual Service, Pool and SE-GROUP Engine are brought up and the VIP is routables without issue.

When it doesnt work - we see that that the VSVIP is and Virtual Service are created however, we see a couple different errors in the AVI controller:

Conditions - between each CI/CD run, we remove all traces of the VSVIP, Service Engine (not the group, but the SE that was broughtup), and traces of the Static Routes that were used from the Worker nodes internal network to the IPs of ETH0 of those nodes (for routing in our AVI).

Errors from AKO Logs When this issue occurs, we see Errors in the AKO PODs complaining that the Service Already exists - when as outlined, all traces (that are available to us) have been removed. We have look and implemented some code check using the go AVI SDK (to list and remove VSVIP, Virtual Service, etc) - however there is no API that lets you remove the Static Routes.

We see that AKO doesnt throw errors, but AVI complains about POOLS being down and such.

Our Thoughts it would seem that AVI Controller is 'cacheing or holding" information about previous setups somewhere that we cannot clean out, this makes doing automated end to end testing very problematic, since we never "know" when the AVI /AKO combination will work or not.

Our setup/configuration is static - doesnt change every deployment and yet, as observed, 1/10 times we deploy the path "opens correctly' - other times we have to bounce AKO pods, remove and read things in AVI controller side and its really "hard" to pin down what is the cause of this.

Our need is that we can deploy our clusters (end to end including the LB setusp for our vips) are part of our e2e testing cycle, however this instability is creating some havoc as you can imagine.

Any thoughts on where this might be "hidden" so we can remove the memory of a previous deploy and our LBs come up without issue?

lmcdasm commented 1 year ago

Hello>

One "workaround" that seems to "help" is the following:

This of course, cannot be "the way" this CNI works, but at the least it shows there are some other "bits" that are being left on the AVI that we do not see.