F5Networks / k8s-bigip-ctlr

Repository for F5 Container Ingress Services for Kubernetes & OpenShift.
Apache License 2.0
357 stars 195 forks source link

Configuration error in VirtualServer or related manifests blocks pools in F5 to be updated via AS3 #2759

Closed mikejoh closed 1 year ago

mikejoh commented 1 year ago

Setup Details

CIS Version : v2.11.1 Build: f5networks/k8s-bigip-ctlr:v2.11.1 BIGIP Version: Big IP v15.1.7 AS3 Version: v3.38.0 Agent Mode: AS3 Orchestration: k8s Orchestration Version: v1.24.8 Pool Mode: Cluster Additional Setup details: Cilium as CNI.

Description

For a while (since at least v2.10.1) we've observed an issue where a configuration error in a VirtualServer or in an adjecent TLSProfile referenced in a VirtualServer results in 422 errors from the the AS3 REST API. As long as we have this configuration error other changes, in our case updates to pools, doesn't work as expected. If we try to redeploy a service, replacing all Pods with new Pods with new IPs the pools in the F5 won't get this information rendering the pool as down. This only recovers if we fix the problematic VirtualServer or e.g. TLSProfile.

This issue is similar to #2391.

Steps To Reproduce

  1. I'm updating an existing TLSProfile referenced in an existing VirtualServer with the following:

    • reference: bigip (was secret)
    • clientTLS: /Common/does-not-exist
      2023/02/06 20:42:52 [DEBUG] Processing Key: &{nginx TLSProfile nginx-tlsprofile 0xc0000ce1a0 Update}
      2023/02/06 20:42:52 [INFO] Enqueueing TLSProfile
  2. This fails as expected resulting in a 422 response, the TLS profile doesn't exist in the BIGIP:

    2023/02/06 20:42:52 [DEBUG] [AS3] PostManager Accepted the configuration
    2023/02/06 20:42:52 [DEBUG] [AS3] posting request to https://f5-lb.example.com/mgmt/shared/appsvcs/declare/cluster-partition
    2023/02/06 20:43:02 [ERROR] [AS3] Raw response from Big-IP: map[code:422 declarationFullId: message:Unable to find /Common/does-not-exist for /cluster-partition/Shared/nginx_443/serverTLS/0]
  3. No problems so far, now i deleted a Pod in a Pool referenced in another VirtualServer (that works just fine at this moment), mimics e.g. a Deployment where all Pods are terminated and recreated. Since they'll get new Pod IPs, these needs to be updated in the F5 pools. The F5 controller picks something up here:

    2023/02/06 20:45:07 [DEBUG] Enqueueing Endpoints: &Endpoints{ObjectMeta:{working-nginx  kube-system  65f78d08-ebb5-4440-a3e2-b70235f469a2 34329803 0 2022-12-13 08:42:47 +0000 UTC <nil> <nil> map[app.kubernetes.io/managed-by:Helm k8s-app:hubble-ui] map[] [] []  [{kube-controller-manager Update v1 2023-01-16 15:11:37 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{},"f:k8s-app":{}}}}}]},Subsets:[]EndpointSubset{},}
    2023/02/06 20:45:07 [DEBUG] Processing Key: &{kube-system Endpoints working-nginx 0xc00054a3c0 Update}
    2023/02/06 20:45:09 [DEBUG] Enqueueing Endpoints
  4. The service is now down in the F5 and the pool reports the backend as down, and i can see that it's the old IP address the Pod had before i deleted it. If i correct the error in the TLSProfile soon after the F5 now has correct info in the Pool of that other VirtualServer.

Expected Result

If we have a configuration error in one of the VirtualServer or related manifests it shouldn't affect all the others.

Actual Result

Downtime in services being deployed after the configuration error was introduced since no updates to pools with new Pod IPs gets through for some reason.

Diagnostic Information

N/A

Observations (if any)

N/A

mikejoh commented 1 year ago

By using the code in this PR https://github.com/F5Networks/k8s-bigip-ctlr/pull/2757 i took a look at the http.Client communication with the F5. Hopefully this might shed some light on what's going on in the k8s-bigip-ctlr.

The dark red/purple line that is increasing is the 422 responses. This started as i reconfigured the TLSProfile until i fixed it again. Then the blue:ish line increases by one, that is a POST that got a 200 status back. During the increase of 422 no other POST's were made it seems, that could partly explain why we don't see any updates to pools in the F5.

image

trinaths commented 1 year ago

Created [CONTCNTR-3796] for internal tracking.

mikejoh commented 1 year ago

@trinaths :wave: Any updates on this one? It would be super interesting to see if you've found something in regards to this problem. Let me know if you need any more info! Thanks!

mikejoh commented 1 year ago

@trinaths Have anyone had time to look into this? Thanks!

trinaths commented 1 year ago

CIS does the schema and some basic validations, however invalid configs like incorrect paths of some resources of BigIP being referenced in VS can only be found only after posting the declaration. In this case a 422 status code is returned with error message of some path references are incorrect. Ensure that the declaration that they are applying are valid and successfully applied, if there is any invalid VS then either correct it or disable it by setting f5cr: "false" so that it's not blocking posting of other VS configs.

mikejoh commented 1 year ago

CIS does the schema and some basic validations, however invalid configs like incorrect paths of some resources of BigIP being referenced in VS can only be found only after posting the declaration. In this case a 422 status code is returned with error message of some path references are incorrect. Ensure that the declaration that they are applying are valid and successfully applied, if there is any invalid VS then either correct it or disable it by setting f5cr: "false" so that it's not blocking posting of other VS configs.

@trinaths This still feels like a valid issue, i understand that the CIS does some basic validation and that the BIGIP will do the rest when we POST to the AS3 API. What i don't understand is why during this time, as seen in the graph i posted above, no other POSTs (or updates) gets through. Only after we've fixed the 422 issue in the problematic VirtualServer (and/or related manifests), we get updates of e.g. existing member pools through.

Since https://github.com/F5Networks/k8s-bigip-ctlr/pull/2757 recently landed we've now configured alerts through Alertmanager to give us a heads up when we have constant 422's returning from the BIGIP.