Closed mikejoh closed 1 year ago
By using the code in this PR https://github.com/F5Networks/k8s-bigip-ctlr/pull/2757 i took a look at the http.Client
communication with the F5. Hopefully this might shed some light on what's going on in the k8s-bigip-ctlr
.
The dark red/purple line that is increasing is the 422
responses. This started as i reconfigured the TLSProfile
until i fixed it again. Then the blue:ish line increases by one, that is a POST
that got a 200
status back. During the increase of 422
no other POST
's were made it seems, that could partly explain why we don't see any updates to pools in the F5.
Created [CONTCNTR-3796] for internal tracking.
@trinaths :wave: Any updates on this one? It would be super interesting to see if you've found something in regards to this problem. Let me know if you need any more info! Thanks!
@trinaths Have anyone had time to look into this? Thanks!
CIS does the schema and some basic validations, however invalid configs like incorrect paths of some resources of BigIP being referenced in VS can only be found only after posting the declaration. In this case a 422 status code is returned with error message of some path references are incorrect. Ensure that the declaration that they are applying are valid and successfully applied, if there is any invalid VS then either correct it or disable it by setting f5cr: "false" so that it's not blocking posting of other VS configs.
CIS does the schema and some basic validations, however invalid configs like incorrect paths of some resources of BigIP being referenced in VS can only be found only after posting the declaration. In this case a 422 status code is returned with error message of some path references are incorrect. Ensure that the declaration that they are applying are valid and successfully applied, if there is any invalid VS then either correct it or disable it by setting f5cr: "false" so that it's not blocking posting of other VS configs.
@trinaths This still feels like a valid issue, i understand that the CIS does some basic validation and that the BIGIP will do the rest when we POST to the AS3 API. What i don't understand is why during this time, as seen in the graph i posted above, no other POSTs (or updates) gets through. Only after we've fixed the 422
issue in the problematic VirtualServer
(and/or related manifests), we get updates of e.g. existing member pools through.
Since https://github.com/F5Networks/k8s-bigip-ctlr/pull/2757 recently landed we've now configured alerts through Alertmanager to give us a heads up when we have constant 422
's returning from the BIGIP.
Setup Details
CIS Version : v2.11.1 Build: f5networks/k8s-bigip-ctlr:v2.11.1 BIGIP Version: Big IP v15.1.7 AS3 Version: v3.38.0 Agent Mode: AS3 Orchestration: k8s Orchestration Version: v1.24.8 Pool Mode: Cluster Additional Setup details: Cilium as CNI.
Description
For a while (since at least
v2.10.1
) we've observed an issue where a configuration error in aVirtualServer
or in an adjecentTLSProfile
referenced in aVirtualServer
results in422
errors from the the AS3 REST API. As long as we have this configuration error other changes, in our case updates to pools, doesn't work as expected. If we try to redeploy a service, replacing all Pods with new Pods with new IPs the pools in the F5 won't get this information rendering the pool as down. This only recovers if we fix the problematicVirtualServer
or e.g.TLSProfile
.This issue is similar to #2391.
Steps To Reproduce
I'm updating an existing
TLSProfile
referenced in an existingVirtualServer
with the following:reference: bigip
(wassecret
)clientTLS: /Common/does-not-exist
This fails as expected resulting in a
422
response, the TLS profile doesn't exist in the BIGIP:No problems so far, now i deleted a Pod in a Pool referenced in another
VirtualServer
(that works just fine at this moment), mimics e.g. a Deployment where all Pods are terminated and recreated. Since they'll get new Pod IPs, these needs to be updated in the F5 pools. The F5 controller picks something up here:The service is now down in the F5 and the pool reports the backend as down, and i can see that it's the old IP address the Pod had before i deleted it. If i correct the error in the
TLSProfile
soon after the F5 now has correct info in the Pool of that otherVirtualServer
.Expected Result
If we have a configuration error in one of the
VirtualServer
or related manifests it shouldn't affect all the others.Actual Result
Downtime in services being deployed after the configuration error was introduced since no updates to pools with new Pod IPs gets through for some reason.
Diagnostic Information
N/A
Observations (if any)
N/A