F5Networks / k8s-bigip-ctlr

Repository for F5 Container Ingress Services for Kubernetes & OpenShift.
Apache License 2.0
351 stars 193 forks source link

LB Services stay stuck in Pending state #3443

Open visokoo opened 1 month ago

visokoo commented 1 month ago

Setup Details

CIS Version : 2.16.1
Build: f5networks/k8s-bigip-ctlr:latest BIGIP Version: Big IP 15.1.10.2 Build 0.44.2 Engineering Hotfix AS3 Version: 3.44.0 Agent Mode: AS3
Orchestration: K8S
Orchestration Version: 1.27.12+rke2r1 Pool Mode: Nodeport Additional Setup details:

f5-bigip-ctlr

  volumes:
    - name: bigip-creds
      secret:
        secretName: f5-bigip-ctlr-login
        defaultMode: 420
    - name: kube-api-access-v7f48
      projected:
        sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              name: kube-root-ca.crt
              items:
                - key: ca.crt
                  path: ca.crt
          - downwardAPI:
              items:
                - path: namespace
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
        defaultMode: 420
  containers:
    - name: f5-bigip-ctlr
      image: f5networks/k8s-bigip-ctlr:latest
      command:
        - /app/bin/k8s-bigip-ctlr
      args:
        - '--credentials-directory'
        - /tmp/creds
        - '--bigip-partition=k8s-dc2-dev'
        - '--bigip-url=10.160.128.60'
        - '--custom-resource-mode=true'
        - '--insecure=true'
        - '--ipam=true'
        - '--log-as3-response=true'
        - '--log-level=DEBUG'
        - '--pool-member-type=nodeport'

f5-ipam-controller

spec:
  volumes:
    - name: infoblox-creds
      secret:
        secretName: infoblox-credentials
        defaultMode: 420
    - name: kube-api-access-6g96z
      projected:
        sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              name: kube-root-ca.crt
              items:
                - key: ca.crt
                  path: ca.crt
          - downwardAPI:
              items:
                - path: namespace
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
        defaultMode: 420
  containers:
    - name: f5-ipam-controller
      image: f5networks/f5-ipam-controller:0.1.10
      command:
        - /app/bin/f5-ipam-controller
      args:
        - '--orchestration=kubernetes'
        - '--ipam-provider=infoblox'
        - '--infoblox-wapi-version=2.11.3'
        - '--infoblox-labels'
        - '{"vips":{"cidr":"10.160.151.0/24"}}'
        - '--infoblox-netview=OT-US'
        - '--credentials-directory'
        - /tmp/creds
        - '--log-level=DEBUG'

VirtualServer Spec

spec:
  host: '*.redacted'
  httpTraffic: allow
  ipamLabel: vips
  pools:
    - monitor:
        interval: 10
        send: /
        timeout: 31
        type: http
      path: /
      service: rke2-ingress-nginx-controller
      servicePort: 80
  profileWebSocket: /Common/websocket
  tlsProfileName: edge-dev
  virtualServerAddress: 10.160.151.4
  virtualServerHTTPPort: 80
  virtualServerHTTPSPort: 443

Description

We're using the f5-bigip-ctlr and f5-ipam-controller (InfoBlox) helm charts to create Loadbalancer type K8s Services for applications. We've noticed intermittent functionality where when we redeploy a service with the same name (deleting it first), when we try to relaunch it, only one of the 3 LoadBalancer type Services come back asActivewhile the others stay stuck inPending`.

kubectl get services -n fs-platform                                                                                                                                                    ✔  dev-dc2 ⎈  01:07:26 PM  
NAME                                        TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                                                                                                                                                                                  AGE
factory-software-testing9-client            ClusterIP      10.43.9.2       <none>           7000/TCP,7001/TCP,7199/TCP,10001/TCP,9180/TCP,5090/TCP,9100/TCP,9042/TCP,9142/TCP,19042/TCP,19142/TCP,9160/TCP                                                                           45h
factory-software-testing9-us-east-2-ng1-0   LoadBalancer   10.43.79.224    <pending>        7000:31676/TCP,7001:32373/TCP,7199:30680/TCP,10001:31969/TCP,9180:30469/TCP,5090:32109/TCP,9100:31469/TCP,9042:30594/TCP,9142:31835/TCP,19042:32473/TCP,19142:31958/TCP,9160:32258/TCP   45h
factory-software-testing9-us-east-2-ng1-1   LoadBalancer   10.43.235.226   <pending>        7000:31441/TCP,7001:30657/TCP,7199:31350/TCP,10001:30535/TCP,9180:31226/TCP,5090:31295/TCP,9100:32288/TCP,9042:30675/TCP,9142:30133/TCP,19042:30305/TCP,19142:32591/TCP,9160:31797/TCP   45h
factory-software-testing9-us-east-2-ng1-2   LoadBalancer   10.43.203.116   10.160.151.119   7000:30549/TCP,7001:32751/TCP,7199:31903/TCP,10001:30850/TCP,9180:32009/TCP,5090:31543/TCP,9100:30006/TCP,9042:31968/TCP,9142:30581/TCP,19042:30350/TCP,19142:32028/TCP,9160:31130/TCP   45h                            

This is the test manifest I'm using:

apiVersion: scylla.scylladb.com/v1
kind: ScyllaCluster
metadata:
  name: factory-software-testing9
  namespace: fs-platform
spec:
  agentRepository: docker.io/scylladb/scylla-manager-agent
  agentVersion: 3.2.6
  datacenter:
    name: us-east-2
    racks:
      - agentResources:
          requests:
            cpu: 50m
            memory: 10M
        members: 3
        name: ng1
        resources:
          limits:
            cpu: 2
            memory: 8Gi
          requests:
            cpu: 2
            memory: 8Gi
        scyllaAgentConfig: scylla-agent-config
        scyllaConfig: scylla-config
        storage:
          capacity: 128Gi
          storageClassName: sc-tier1
        volumeMounts:
          - mountPath: /tmp/coredumps
            name: coredumpfs
        volumes:
          - hostPath:
              path: /tmp/coredumps
            name: coredumpfs
  exposeOptions:
    broadcastOptions:
      clients:
        type: ServiceLoadBalancerIngress
      nodes:
        type: ServiceClusterIP
    nodeService:
      annotations:
        cis.f5.com/health: '{"interval": 10, "timeout": 31}'
        cis.f5.com/ipamLabel: vips
      type: LoadBalancer
  network:
    hostNetworking: false
  repository: docker.io/scylladb/scylla
  sysctls:
    - fs.aio-max-nr=2097152
  version: 5.4.3

When I change the name of the service, it seems to deploy fine where everything is getting the appropriate ExternalIP assigned to it and the Services go into Active.

Steps To Reproduce

1) Deploy f5-cis and f5-ipam-controller with the above specs 2) Deploy the test ScyllaDB instance and see everything eventually go green because the K8s Services all come up 3) Confirm that all F5 VS' come up with the right VIP 4) Delete the test ScyllaDB instance and confirm that all F5 VS' are gone as well and the IPAM entry in the DB is gone as well 5) Redeploy the test ScyllaDB instance and confirm that there are a few K8s Serviecs that stay stuck in Pending state even though the IPAM picks up the same IPs again to be divvied out.

Expected Result

A redeployed application's K8s Loadbalancer Services should go into Active state and get an ExternalIP assigned to it so the pods can proceed.

Actual Result

Some of the redeployed K8s Services stay in Pending state and never go into Active, causing the downstream pods to never go into Ready state.

Diagnostic Information

Attached logs: debug_all_service_green_f5-cis_scrubbed.log debug_all_service_green_f5-ipam_scrubbed.log debug_delete_deploy_f5-cis_scrubbed.log debug_delete_deploy_f5-ipam_scrubbed.log debug_redeploy_old_service_deploy_fail_f5-cis_scrubbed.log debug_redeploy_old_service_deploy_fail_f5-ipam_scrubbed.log

Observations (if any)

We were originally using the ipamLabel option as well to make sure that dns names weren't conflicting in Infoblox but removed it for testing.

mdditt2000 commented 1 month ago

thanks @visokoo for opening this issue. Added @trinaths Maybe next week we should schedule sometime to troubleshoot this. @visokoo please reach out us at automation_toolchain_pm@f5.com

visokoo commented 1 month ago

Reached out to you via that alias. Would appreciate the support, thank you!

charanm08 commented 3 weeks ago

Hi @visokoo We have started working on this issue. We are able to reproduce this issue.

trinaths commented 2 weeks ago

Created [CONTCNTR-4751] for internal tracking.