Kong / gateway-operator

Kubernetes Operator for Kong Gateways
Apache License 2.0
41 stars 9 forks source link

Provisioning of second Gateway stuck/slow #219

Closed MatthiasWinzeler closed 2 months ago

MatthiasWinzeler commented 2 months ago

Current Behavior

We want to deploy two Gateways using the Gateway operator. While the first gateway is programmed successfully, the second gateway never comes up and seems to be stuck somewhere.

Expected Behavior

Both Gateways are programmed successfully.

Steps To Reproduce

We are following the Getting started documentation: https://docs.konghq.com/gateway-operator/latest/get-started/kic/install/

More precisely, this is the steps we run and which can be used to reproduce the problem:

helm repo add kong https://charts.konghq.com
helm repo update kong
helm upgrade --install kgo kong/gateway-operator -n kong-system --create-namespace --set image.tag=1.2

echo '
kind: GatewayConfiguration
apiVersion: gateway-operator.konghq.com/v1beta1
metadata:
  name: kong
  namespace: default
spec:
  dataPlaneOptions:
    deployment:
      podTemplateSpec:
        spec:
          containers:
          - name: proxy
            image: kong:3.6.1
            readinessProbe:
              initialDelaySeconds: 1
              periodSeconds: 1
  controlPlaneOptions:
    deployment:
      podTemplateSpec:
        spec:
          containers:
          - name: controller
            image: kong/kubernetes-ingress-controller:3.1.3
            env:
            - name: CONTROLLER_LOG_LEVEL
              value: debug
---
kind: GatewayClass
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: kong
spec:
  controllerName: konghq.com/gateway-operator
  parametersRef:
    group: gateway-operator.konghq.com
    kind: GatewayConfiguration
    name: kong
    namespace: default
' | kubectl apply -f -

Then, we create the first gateway:

echo '
kind: Gateway
apiVersion: gateway.networking.k8s.io/v1
metadata:
  name: gw1
  namespace: default
spec:
  gatewayClassName: kong
  listeners:
    - name: http
      protocol: HTTP
      port: 80
' | kubectl apply -f -

This gateway comes up successfully:

k get gateway -n default | grep gw1
gw1               kong     20.250.194.216   True         80s

k get all -n default | grep gw1
pod/controlplane-gw1-75vrb-cfdmz-6f4ff8fff7-t2hjf   1/1     Running   0          65s
pod/dataplane-gw1-fbh78-2nxfw-765b85d8bb-km4mv      1/1     Running   0          65s
service/controlplane-webhook-gw1-75vrb-m4nww   ClusterIP      10.11.199.79   <none>           8080/TCP                     65s
service/dataplane-admin-gw1-fbh78-zvvn9        ClusterIP      None           <none>           8444/TCP                     66s
service/dataplane-ingress-gw1-fbh78-2j498      LoadBalancer   10.11.49.248   20.250.194.216   80:32347/TCP                 66s
deployment.apps/controlplane-gw1-75vrb-cfdmz   1/1     1            1           65s
deployment.apps/dataplane-gw1-fbh78-2nxfw      1/1     1            1           65s
replicaset.apps/controlplane-gw1-75vrb-cfdmz-6f4ff8fff7   1         1         1       65s
replicaset.apps/dataplane-gw1-fbh78-2nxfw-765b85d8bb      1         1         1       65s
controlplane.gateway-operator.konghq.com/gw1-75vrb   True    True
dataplane.gateway-operator.konghq.com/gw1-fbh78   True

logs:
k logs kgo-gateway-operator-controller-manager-667f9444ff-k99c7 -n kong-system
...
{"level":"info","ts":"2024-04-24T08:35:49Z","logger":"controlplane","msg":"gateway accepted","controller":"gateway","controllerGroup":"gateway.networking.k8s.io","controllerKind":"Gateway","Gateway":{"name":"gw1","namespace":"default"},"namespace":"default","name":"gw1","reconcileID":"4b18c160-3257-4605-99ff-1b4a32d30ec4","namespace":"default","name":"gw1"}
{"level":"info","ts":"2024-04-24T08:35:49Z","logger":"controlplane","msg":"no ingress services found for dataplane","controller":"gateway","controllerGroup":"gateway.networking.k8s.io","controllerKind":"Gateway","Gateway":{"name":"gw1","namespace":"default"},"namespace":"default","name":"gw1","reconcileID":"e792c4ad-4a4f-48e9-8f79-9370ef758f68","namespace":"default","name":"gw1","dataplane":{"name":"gw1-fbh78","namespace":"default"}}
{"level":"info","ts":"2024-04-24T08:35:50Z","logger":"controlplane","msg":"no ingress services found for dataplane","controller":"gateway","controllerGroup":"gateway.networking.k8s.io","controllerKind":"Gateway","Gateway":{"name":"gw1","namespace":"default"},"namespace":"default","name":"gw1","reconcileID":"55962592-cc0d-46f4-8464-4f297773a68c","namespace":"default","name":"gw1","dataplane":{"name":"gw1-fbh78","namespace":"default"}}

Then, we create the second gateway:

echo '
kind: Gateway
apiVersion: gateway.networking.k8s.io/v1
metadata:
  name: gw2
  namespace: default
spec:
  gatewayClassName: kong
  listeners:
    - name: http
      protocol: HTTP
      port: 80
' | kubectl apply -f -

However, this gateway never comes up:

k get gateway -n default | grep gw2
gw2               kong                      False        45s

The data plane and control plane object seem to be not provisioned properly:

k get all -n default | grep gw2
pod/dataplane-gw2-n5rw5-l6fn2-664fb588d5-bk2kt      0/1     Running   0          90s
service/controlplane-webhook-gw2-l4fh4-bx7f7   ClusterIP      10.11.174.178   <none>           8080/TCP                     89s
service/dataplane-admin-gw2-n5rw5-fjwzn        ClusterIP      None            <none>           8444/TCP                     91s
service/dataplane-ingress-gw2-n5rw5-j7bdx      LoadBalancer   10.11.72.62     20.250.194.154   80:30843/TCP                 90s
deployment.apps/dataplane-gw2-n5rw5-l6fn2      0/1     1            0           90s
replicaset.apps/dataplane-gw2-n5rw5-l6fn2-664fb588d5      1         1         0       90s
controlplane.gateway-operator.konghq.com/gw2-l4fh4   False   False
dataplane.gateway-operator.konghq.com/gw2-n5rw5   False

k get dataplane gw2-n5rw5 -o yaml -n default
apiVersion: gateway-operator.konghq.com/v1beta1
kind: DataPlane
metadata:
  creationTimestamp: "2024-04-24T08:38:29Z"
  generateName: gw2-
  generation: 1
  labels:
    gateway-operator.konghq.com/managed-by: gateway
  name: gw2-n5rw5
  namespace: default
  ownerReferences:
  - apiVersion: gateway.networking.k8s.io/v1
    controller: true
    kind: Gateway
    name: gw2
    uid: 82c7fc8d-83a1-46ee-9539-0d56c639db5c
  resourceVersion: "13624119"
  uid: 7f7f4e07-0512-49b1-bb5b-673339e2a531
spec:
  deployment:
    podTemplateSpec:
      metadata: {}
      spec:
        containers:
        - image: kong:3.6.1
          name: proxy
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /status/ready
              port: 8100
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources: {}
    replicas: 1
  network:
    services:
      ingress:
        ports:
        - name: http
          port: 80
          targetPort: 8000
        type: LoadBalancer
status:
  addresses:
  - sourceType: PublicLoadBalancer
    type: IPAddress
    value: 20.250.194.154
  - sourceType: PrivateIP
    type: IPAddress
    value: 10.11.72.62
  conditions:
  - lastTransitionTime: "2024-04-24T08:38:30Z"
    message: 'Waiting for the resource to become ready: Deployment dataplane-gw2-n5rw5-l6fn2
      is not ready yet'
    observedGeneration: 1
    reason: WaitingToBecomeReady
    status: "False"
    type: Ready
  readyReplicas: 0
  replicas: 1
  selector: 49dbbe2d-a081-4fcb-86f6-3ba3dc967c13
  service: dataplane-ingress-gw2-n5rw5-j7bdx

k get controlplane gw2-l4fh4 -n default -o yaml
apiVersion: gateway-operator.konghq.com/v1beta1
kind: ControlPlane
metadata:
  creationTimestamp: "2024-04-24T08:38:30Z"
  finalizers:
  - gateway-operator.konghq.com/cleanup-clusterrole
  - gateway-operator.konghq.com/cleanup-clusterrolebinding
  - gateway-operator.konghq.com/cleanup-validatingwebhookconfiguration
  generateName: gw2-
  generation: 2
  labels:
    gateway-operator.konghq.com/managed-by: gateway
  name: gw2-l4fh4
  namespace: default
  ownerReferences:
  - apiVersion: gateway.networking.k8s.io/v1
    controller: true
    kind: Gateway
    name: gw2
    uid: 82c7fc8d-83a1-46ee-9539-0d56c639db5c
  resourceVersion: "13623955"
  uid: fa12ae25-67de-453e-8b34-f9bf1bd7a5c6
spec:
  dataplane: gw2-n5rw5
  deployment:
    podTemplateSpec:
      metadata: {}
      spec:
        containers:
        - env:
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
          - name: CONTROLLER_ANONYMOUS_REPORTS
            value: "true"
          - name: POD_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.name
          - name: CONTROLLER_GATEWAY_API_CONTROLLER_NAME
            value: konghq.com/gateway-operator
          - name: CONTROLLER_PUBLISH_SERVICE
            value: default/dataplane-ingress-gw2-n5rw5-j7bdx
          - name: CONTROLLER_KONG_ADMIN_SVC
            value: default/dataplane-admin-gw2-n5rw5-fjwzn
          - name: CONTROLLER_KONG_ADMIN_SVC_PORT_NAMES
            value: admin
          - name: CONTROLLER_GATEWAY_DISCOVERY_DNS_STRATEGY
            value: service
          - name: CONTROLLER_KONG_ADMIN_INIT_RETRY_DELAY
            value: 5s
          - name: CONTROLLER_GATEWAY_TO_RECONCILE
            value: default/gw2
          - name: CONTROLLER_KONG_ADMIN_TLS_CLIENT_CERT_FILE
            value: /var/cluster-certificate/tls.crt
          - name: CONTROLLER_KONG_ADMIN_TLS_CLIENT_KEY_FILE
            value: /var/cluster-certificate/tls.key
          - name: CONTROLLER_KONG_ADMIN_CA_CERT_FILE
            value: /var/cluster-certificate/ca.crt
          - name: CONTROLLER_ELECTION_ID
            value: gw2-l4fh4.konghq.com
          - name: CONTROLLER_ADMISSION_WEBHOOK_LISTEN
            value: 0.0.0.0:8080
          image: kong/kubernetes-ingress-controller:3.1.2
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /healthz
              port: 10254
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: controller
          ports:
          - containerPort: 10254
            name: health
            protocol: TCP
          - containerPort: 8080
            name: webhook
            protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /readyz
              port: 10254
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              cpu: 200m
              memory: 100Mi
            requests:
              cpu: 100m
              memory: 20Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /var/cluster-certificate
            name: cluster-certificate
            readOnly: true
          - mountPath: /admission-webhook
            name: admission-webhook-certificate
            readOnly: true
    replicas: 1
  gatewayClass: kong
status:
  conditions:
  - lastTransitionTime: "2024-04-24T08:38:30Z"
    message: There are other conditions that are not yet ready
    observedGeneration: 2
    reason: DependenciesNotReady
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-04-24T08:38:30Z"
    message: ControlPlane resource is scheduled for provisioning
    reason: PodsNotReady
    status: "False"
    type: Provisioned

The logs show the following:

k logs kgo-gateway-operator-controller-manager-667f9444ff-k99c7 -n kong-system | grep gw2
{"level":"info","ts":"2024-04-24T08:38:29Z","logger":"controlplane","msg":"gateway accepted","controller":"gateway","controllerGroup":"gateway.networking.k8s.io","controllerKind":"Gateway","Gateway":{"name":"gw2","namespace":"default"},"namespace":"default","name":"gw2","reconcileID":"28ef9bd4-bddf-414a-afa8-de4157339c66","namespace":"default","name":"gw2"}
{"level":"info","ts":"2024-04-24T08:38:30Z","logger":"controlplane","msg":"no ingress services found for dataplane","controller":"gateway","controllerGroup":"gateway.networking.k8s.io","controllerKind":"Gateway","Gateway":{"name":"gw2","namespace":"default"},"namespace":"default","name":"gw2","reconcileID":"f37d692d-e50b-4551-9a6b-a7d31d413244","namespace":"default","name":"gw2","dataplane":{"name":"gw2-n5rw5","namespace":"default"}}
{"level":"info","ts":"2024-04-24T08:38:30Z","logger":"controlplane","msg":"no ingress services found for dataplane","controller":"gateway","controllerGroup":"gateway.networking.k8s.io","controllerKind":"Gateway","Gateway":{"name":"gw2","namespace":"default"},"namespace":"default","name":"gw2","reconcileID":"1dc9e337-2460-4e52-b85a-75b79416f3a6","namespace":"default","name":"gw2","dataplane":{"name":"gw2-n5rw5","namespace":"default"}}

It complains about not finding any ingress services, but it is clearly there - maybe some bug in the operator?

k get svc -n default | grep gw2 | grep ingress
dataplane-ingress-gw2-n5rw5-j7bdx      LoadBalancer   10.11.72.62     20.250.194.154   80:30843/TCP                 5m34s

Operator Version

Tried with image.tag=1.2 (as described in the getting started) and also tried the latest --set image.tag=1.2.3 --set image.repository=docker.io/kong/gateway-operator-oss but the issue remains the same.

kubectl version

Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.3

We are running on AKS 1.28 with Cilium (which also provides a GW, maybe that interferes).

pmalek commented 2 months ago

Hi @MatthiasWinzeler,

I've just tried to reproduce it and I couldn't with 1.2.3 using your GatewayClass and GatewayConfiguration with 2 Gateways.

echo '
kind: Gateway
apiVersion: gateway.networking.k8s.io/v1
metadata:
  name: gw1
  namespace: default
spec:
  gatewayClassName: kong
  listeners:
    - name: http
      protocol: HTTP
      port: 80
---
kind: Gateway
apiVersion: gateway.networking.k8s.io/v1
metadata:
  name: gw2
  namespace: default
spec:
  gatewayClassName: kong
  listeners:
    - name: http
      protocol: HTTP
      port: 80
' | kubectl apply -f -

One thing that stood out to me is that you define kong/kubernetes-ingress-controller:3.1.3 in GatewayConfiguration but the Gateway is running kong/kubernetes-ingress-controller:3.1.2 which for me didn't occur ( I did get the expected image ).

Can you try looking into DataPlane, ControlPlane and Gateway status fields and see if there's anything there that could suggest a culprit? You can use this guide https://docs.konghq.com/gateway-operator/latest/production/monitoring/status/gateway/.

Please also include KGO debug logs ( you can enable those by setting --set env.zap_log_level=2 or --set env.zap_log_level=debug, for latter is less verbose ).


We are running on AKS 1.28 with Cilium (which also provides a GW, maybe that interferes).

This shouldn't be relevant assuming that Cillium's controllers respect the GatewayClass's controllerName field.

MatthiasWinzeler commented 2 months ago

@pmalek Thanks for getting back to me!

One thing that stood out to me is that you define kong/kubernetes-ingress-controller:3.1.3 in GatewayConfiguration but the Gateway is running kong/kubernetes-ingress-controller:3.1.2 which for me didn't occur ( I did get the expected image ).

You're right - in my debugging steps afterwards, I actually removed the whole controlPlaneOptions part, which causes it to fall back to 3.1.2, but the issue remains the same. I think that should not matter, right?

I turned on debug logs and I realized that the control plane deployment is actually created, just way later. Sometimes it takes around 20 minutes, sometimes a little bit less. I captured the logs of a try that takes around 13 minutes - I put the whole log here: https://gist.github.com/MatthiasWinzeler/69469600275264989ddc7f3db4e10b5a

What's interesting is one place where the operator seems to be stuck:

kgo-gateway-operator-controller-manager-89d754fc5-w6sdm manager {"level":"debug","ts":"2024-04-24T15:22:12Z","logger":"controlplane.dataplaneProvisioning","msg":"dataplane config updated","controller":"gateway","controllerGroup":"gateway.networking.k8s.io","controllerKind":"Gateway","Gateway":{"name":"gw2","namespace":"default"},"namespace":"default","name":"gw2","reconcileID":"fbc7449a-b391-4039-b1c1-f816752b2e72","namespace":"default","name":"gw2"}
...
kgo-gateway-operator-controller-manager-89d754fc5-w6sdm manager {"level":"debug","ts":"2024-04-24T15:30:54Z","logger":"controlplane","msg":"deployment for ControlPlane created","controller":"controlplane","controllerGroup":"gateway-operator.konghq.com","controllerKind":"ControlPlane","ControlPlane":{"name":"gw2-b8z9d","namespace":"default"},"namespace":"default","name":"gw2-b8z9d","reconcileID":"52e8b757-a4e2-4369-89fe-b80b1008ad62","namespace":"default","name":"gw2-b8z9d","deployment":"controlplane-gw2-b8z9d-t22kl"}

It's waiting around 8 minutes before creating the deployment of the control plane. Any idea what could happen in this timeframe?

I can see lots of patching existing ValidatingWebhookConfiguration in the meantime.

There's another wait of around 4 minutes later down:

kgo-gateway-operator-controller-manager-89d754fc5-w6sdm manager {"level":"debug","ts":"2024-04-24T15:31:05Z","logger":"controlplane.dataplaneProvisioning","msg":"dataplane config updated","controller":"gateway","controllerGroup":"gateway.networking.k8s.io","controllerKind":"Gateway","Gateway":{"name":"gw2","namespace":"default"},"namespace":"default","name":"gw2","reconcileID":"e384ced3-1772-427e-9f5b-19eab59b9ea5","namespace":"default","name":"gw2"}
kgo-gateway-operator-controller-manager-89d754fc5-w6sdm manager {"level":"debug","ts":"2024-04-24T15:35:54Z","logger":"controlplane","msg":"patching ControlPlane status","controller":"controlplane","controllerGroup":"gateway-operator.konghq.com","controllerKind":"ControlPlane","ControlPlane":{"name":"gw2-b8z9d","namespace":"default"},"namespace":"default","name":"gw2-b8z9d","reconcileID":"3b8ed356-8f94-4696-b3d2-8130a048eb1f","namespace":"default","name":"gw2-b8z9d","status":{"conditions":[{"type":"Ready","status":"True","observedGeneration":2,"lastTransitionTime":"2024-04-24T15:35:54Z","reason":"Ready","message":""},{"type":"Provisioned","status":"True","observedGeneration":2,"lastTransitionTime":"2024-04-24T15:35:54Z","reason":"PodsReady","message":"pods for all Deployments are ready"}]}}

It's waiting almost 5 minutes here. I thought maybe the cluster is overloaded, but the nodes are pretty idle.

I'll also added the output/status of the kube resources to the gist. Many thanks already for your help!

MatthiasWinzeler commented 2 months ago

I wonder if the patching existing ValidatingWebhookConfiguration could be due to some leftovers of the Kong Ingress Controller that was running in this cluster before...? maybe some webhook conf that's not properly cleaned up?

pmalek commented 2 months ago

I wonder if the patching existing ValidatingWebhookConfiguration could be due to some leftovers of the Kong Ingress Controller that was running in this cluster before...? maybe some webhook conf that's not properly cleaned up?

Potentially but what's happening most likely is that something outside of KGO is updating ControlPlane's ValidatingWebhookConfiguration ObjectMeta which is not properly enforced.

225 should fix that.

pmalek commented 2 months ago

@MatthiasWinzeler You can test the fix using a nightly or a concrete sha based image: https://hub.docker.com/layers/kong/gateway-operator-oss/sha-9015ff6/images/sha256-9d85254612065a1f7ba702d8d595712c729db95cd8ac8d4f6441c3686497c826?context=explore.

MatthiasWinzeler commented 2 months ago

@pmalek thanks!

I tried this image but the issue persists. For instance just now, I have a gateway that's stuck with Programmed = false for about 35 minutes.

To rule out any conflicts with older KIC versions that were on the cluster previously and Cilium, I deployed a vanilla, fresh AKS 1.28 cluster (without Cilium, but with the Azure CNI) and I get the same issue. Any idea how we could further investigate this?

pmalek commented 2 months ago

I see. I don't have an Azure cluster readily available for testing but I'll see what I can do.

What you can do in meantime is try to find if the latest nightly image changed what's being logged and post if you've observed anything. You could also use something like https://github.com/ahmetb/kubectl-tree to print all the dependant objects of the Gateway to see where we're stuck (is it the DataPlane or ControlPlane ?).

I'd also look at the DataPlane Service, you can get it in the status:

kg dataplane -n default kong-dgw79   -o jsonpath-as-json='{.status}'
[
    {
        "addresses": [
            {
                "sourceType": "PrivateLoadBalancer",
                "type": "IPAddress",
                "value": "172.18.128.2"
            },
            {
                "sourceType": "PrivateIP",
                "type": "IPAddress",
                "value": "10.96.199.53"
            }
        ],
        "conditions": [
            {
                "lastTransitionTime": "2024-04-29T09:12:47Z",
                "message": "",
                "observedGeneration": 1,
                "reason": "Ready",
                "status": "True",
                "type": "Ready"
            }
        ],
        "readyReplicas": 2,
        "replicas": 2,
        "selector": "afbb8e60-996e-4f35-b733-ca743323da42",
        "service": "dataplane-ingress-kong-dgw79-5cgqt"
    }
]

and see if the Service has an LB created for it and ready. It might be that it takes a while for the cloud provider to create an LB and all the necessary resources along with it.

MatthiasWinzeler commented 2 months ago

hi @pmalek

it seems that the control plane is not getting created (and since the data plane requires a control plane to come up, it is stuck too). you can find all the YAML output of the related objects in the gist above.

I just captured another try on a fresh AKS cluster with the latest nightly build (image tag sha-1b2f7ee-amd64) while it's stuck for 11 minutes and counting:

k get controlplane
NAME        READY   PROVISIONED
gw1-xtxhp   True    True
gw2-c6psk   False   False

k get deployment
NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
controlplane-gw1-xtxhp-4ps99   1/1     1            1           44h
dataplane-gw1-5khtj-ksnp9      1/1     1            1           44h
dataplane-gw2-ffpn8-jn4xr      0/1     1            0           11m <-- stuck for 11m

k get controlplane -o wide
NAME        READY   PROVISIONED
gw1-xtxhp   True    True
gw2-c6psk   False   False <-- stuck

Please see the following gist for all output YAMLs and logs: https://gist.github.com/MatthiasWinzeler/80769bde7c20b31a67f0f88b1c7b9510

If it makes it easier for you, I'm also available for pairing if that's easier for you - or try to give you access to our AKS cluster :)

MatthiasWinzeler commented 2 months ago

the output of the kubectl tree for the stuck gateway is as follows (cool tool by the way!):

kubectl tree gateway gw2
NAMESPACE  NAME                                                            READY  REASON                AGE
default    Gateway/gw2                                                     False  DependenciesNotReady  62m
default    ├─ControlPlane/gw2-c6psk                                        False  DependenciesNotReady  62m
default    │ ├─Secret/controlplane-gw2-c6psk-4twpt                         -                            62m
default    │ ├─Secret/controlplane-gw2-c6psk-hxgph                         -                            62m
default    │ ├─Service/controlplane-webhook-gw2-c6psk-g4nmm                -                            62m
default    │ │ └─EndpointSlice/controlplane-webhook-gw2-c6psk-g4nmm-sfrxl  -                            62m
default    │ └─ServiceAccount/controlplane-gw2-c6psk-zf9dq                 -                            62m
default    └─DataPlane/gw2-ffpn8                                           False  WaitingToBecomeReady  62m
default      ├─Deployment/dataplane-gw2-ffpn8-jn4xr                        -                            62m
default      │ └─ReplicaSet/dataplane-gw2-ffpn8-jn4xr-76cc57c7df           -                            62m
default      │   └─Pod/dataplane-gw2-ffpn8-jn4xr-76cc57c7df-kgb8n          False  ContainersNotReady    62m
default      ├─Secret/dataplane-gw2-ffpn8-nvh94                            -                            62m
default      ├─Service/dataplane-admin-gw2-ffpn8-792c7                     -                            62m
default      │ └─EndpointSlice/dataplane-admin-gw2-ffpn8-792c7-d44j2       -                            62m
default      └─Service/dataplane-ingress-gw2-ffpn8-kwjmc                   -                            62m
default        └─EndpointSlice/dataplane-ingress-gw2-ffpn8-kwjmc-79lxz     -                            62m
pmalek commented 2 months ago

It appears that we've been still pushing the images from our old repo (I've disabled that now) which is why you see:

kgo-gateway-operator-controller-manager-776ff5bb95-n5f78 manager {"level":"info","ts":"2024-04-30T08:36:42Z","logger":"setup","msg":"starting controller manager","release":"nightly-amd64","repo":"https://github.com/Kong/gateway-operator-archive.git","commit":"1b2f7ee305cdeb7e27bc66c990e8d32d36292f38"}

in the logs and not (exemplar output):

{"level":"info","ts":"2024-04-30T10:00:31Z","logger":"setup","msg":"starting controller manager","release":"1.2.3-arm64","repo":"https://github.com/Kong/gateway-operator.git","commit":"ab27c2e00e238b7efd6af674f3213da17d8dedb6"} 

If you want to test nightly you can try:

https://hub.docker.com/layers/kong/gateway-operator-oss/sha-8f8c621/images/sha256-594550d922473f7c1da70c5abcf8c38f69b4581d9577e15c15f16e3bde594372?context=explore

which is the image for the latest commit in this repo: https://github.com/Kong/gateway-operator/commit/8f8c621c13db8165568df2c65f4a7ad0e11c4010

MatthiasWinzeler commented 2 months ago

@pmalek good catch - I changed it and am using the image sha-8f8c621 now:

{"level":"info","ts":"2024-04-30T10:05:38Z","logger":"setup","msg":"starting controller manager","release":"nightly-amd64","repo":"https://github.com/Kong/gateway-operator.git","commit":"8f8c621c13db8165568df2c65f4a7ad0e11c4010"}

However, after deleting and creating the gateway again, it still is stuck:

kubectl tree gateway gw2
NAMESPACE  NAME                                                            READY  REASON                AGE
default    Gateway/gw2                                                     False  DependenciesNotReady  2m50s
default    ├─ControlPlane/gw2-t4vhv                                        False  DependenciesNotReady  2m50s
default    │ ├─Secret/controlplane-gw2-t4vhv-dr7w4                         -                            2m49s
default    │ ├─Secret/controlplane-gw2-t4vhv-nd97l                         -                            2m49s
default    │ ├─Service/controlplane-webhook-gw2-t4vhv-2xxxk                -                            2m49s
default    │ │ └─EndpointSlice/controlplane-webhook-gw2-t4vhv-2xxxk-6v52r  -                            2m49s
default    │ └─ServiceAccount/controlplane-gw2-t4vhv-8m8mh                 -                            2m50s
default    └─DataPlane/gw2-n6dx5                                           False  WaitingToBecomeReady  2m50s
default      ├─Deployment/dataplane-gw2-n6dx5-2b9z6                        -                            2m50s
default      │ └─ReplicaSet/dataplane-gw2-n6dx5-2b9z6-768b66c9f5           -                            2m50s
default      │   └─Pod/dataplane-gw2-n6dx5-2b9z6-768b66c9f5-mf7jw          False  ContainersNotReady    2m50s
default      ├─Secret/dataplane-gw2-n6dx5-p88wm                            -                            2m50s
default      ├─Service/dataplane-admin-gw2-n6dx5-2c6c4                     -                            2m50s
default      │ └─EndpointSlice/dataplane-admin-gw2-n6dx5-2c6c4-fxpb6       -                            2m50s
default      └─Service/dataplane-ingress-gw2-n6dx5-qlxtq                   -                            2m50s
default        └─EndpointSlice/dataplane-ingress-gw2-n6dx5-qlxtq-jqpll     -                            2m50s

Do you need any other debug info?

pmalek commented 2 months ago

Not sure ATM.

What seems weird is that the Deployment for ControlPlane doesn't seem to get created.

The only way I was able to reproduce that was with setting too high resource request (which would be higher than the default limit) but even then this got logged

2024-04-30T14:40:45.464+0200 - ERROR - Reconciler error - {"controller": "controlplane", "controllerGroup": "gateway-operator.konghq.com", "controllerKind": "ControlPlane", "ControlPlane": {"name":"kong-psc94","namespace":"default"}, "namespace": "default", "name": "kong-psc94", "reconcileID": "f30d24e7-f068-4876-a11d-3076b0a79696", "error": "failed creating ControlPlane Deployment : Deployment.apps \"controlplane-kong-psc94-c92xc\" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: \"16000Mi\": must be less than or equal to memory limit of 100Mi"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /Users/patryk.malek@konghq.com/.gvm/pkgsets/go1.22.2/global/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /Users/patryk.malek@konghq.com/.gvm/pkgsets/go1.22.2/global/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /Users/patryk.malek@konghq.com/.gvm/pkgsets/go1.22.2/global/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227

Setting both limits and requests too high would still create the Deployment and then its Pods would get into a Pending state.

Can you look into

?


For reference working objects hierarchy under Gateway:

NAMESPACE  NAME                                                             READY  REASON  AGE
default    Gateway/kong                                                     True   Ready   132m
default    ├─ControlPlane/kong-vs42s                                        True   Ready   130m
default    │ ├─Deployment/controlplane-kong-vs42s-9pj95                     -              130m
default    │ │ ├─ReplicaSet/controlplane-kong-vs42s-9pj95-64957d6f86        -              130m
default    │ │ └─ReplicaSet/controlplane-kong-vs42s-9pj95-7ccbd87d9         -              30s
default    │ │   └─Pod/controlplane-kong-vs42s-9pj95-7ccbd87d9-zltmk        True           30s
default    │ ├─Secret/controlplane-kong-vs42s-485pr                         -              130m
default    │ ├─Secret/controlplane-kong-vs42s-w2grp                         -              130m
default    │ ├─Service/controlplane-webhook-kong-vs42s-vzqbc                -              130m
default    │ │ └─EndpointSlice/controlplane-webhook-kong-vs42s-vzqbc-sbptb  -              130m
default    │ └─ServiceAccount/controlplane-kong-vs42s-929sj                 -              130m
default    ├─DataPlane/kong-58h4l                                           True   Ready   132m
default    │ ├─Deployment/dataplane-kong-58h4l-c4lpq                        -              132m
default    │ │ └─ReplicaSet/dataplane-kong-58h4l-c4lpq-55ddd75fc7           -              132m
default    │ │   ├─Pod/dataplane-kong-58h4l-c4lpq-55ddd75fc7-7tnmb          True           132m
default    │ │   └─Pod/dataplane-kong-58h4l-c4lpq-55ddd75fc7-swqk5          True           132m
default    │ ├─Secret/dataplane-kong-58h4l-rk5xh                            -              132m
default    │ ├─Service/dataplane-admin-kong-58h4l-ftrx5                     -              132m
default    │ │ └─EndpointSlice/dataplane-admin-kong-58h4l-ftrx5-g94wt       -              132m
default    │ └─Service/dataplane-ingress-kong-58h4l-wt2zk                   -              132m
default    │   └─EndpointSlice/dataplane-ingress-kong-58h4l-wt2zk-7jqb8     -              132m
default    └─NetworkPolicy/kong-58h4l-limit-admin-api-dlsll                 -              132m
MatthiasWinzeler commented 2 months ago

@pmalek

There are some interesting warnings in the events:

61s         Normal    KongConfigurationSucceeded   pod/controlplane-gw1-xtxhp-4ps99-65864757d8-cj9z2   successfully applied Kong configuration to https://10-0-4-37.dataplane-admin-gw1-5khtj-n4cv4.default.svc:8444
55s         Normal    EnsuringLoadBalancer         service/dataplane-ingress-gw2-22pwd-7bj6m           Ensuring load balancer
55s         Normal    Scheduled                    pod/dataplane-gw2-22pwd-dvczf-f5948c849-t86ws       Successfully assigned default/dataplane-gw2-22pwd-dvczf-f5948c849-t86ws to aks-default-42903236-vmss000002
55s         Warning   FailedToCreateEndpoint       endpoints/dataplane-ingress-gw2-22pwd-7bj6m         Failed to create endpoint for service default/dataplane-ingress-gw2-22pwd-7bj6m: endpoints "dataplane-ingress-gw2-22pwd-7bj6m" already exists
55s         Normal    SuccessfulCreate             replicaset/dataplane-gw2-22pwd-dvczf-f5948c849      Created pod: dataplane-gw2-22pwd-dvczf-f5948c849-t86ws
55s         Normal    ScalingReplicaSet            deployment/dataplane-gw2-22pwd-dvczf                Scaled up replica set dataplane-gw2-22pwd-dvczf-f5948c849 to 1
54s         Normal    Created                      pod/dataplane-gw2-22pwd-dvczf-f5948c849-t86ws       Created container proxy
54s         Normal    Started                      pod/dataplane-gw2-22pwd-dvczf-f5948c849-t86ws       Started container proxy
54s         Normal    Pulled                       pod/dataplane-gw2-22pwd-dvczf-f5948c849-t86ws       Container image "kong:3.6.1" already present on machine
54s         Warning   OwnerRefInvalidNamespace     clusterrolebinding/controlplane-gw2-zwk5n-sktws     ownerRef [gateway-operator.konghq.com/v1beta1/ControlPlane, namespace: , name: gw2-zwk5n, uid: 3ec3d543-e2b4-46fa-a91b-1806b3473153] does not exist in namespace ""
54s         Warning   OwnerRefInvalidNamespace     clusterrole/gw2-zwk5n-5h7tn                         ownerRef [gateway-operator.konghq.com/v1beta1/ControlPlane, namespace: , name: gw2-zwk5n, uid: 3ec3d543-e2b4-46fa-a91b-1806b3473153] does not exist in namespace ""
53s         Warning   OwnerRefInvalidNamespace     validatingwebhookconfiguration/gw2-zwk5n            ownerRef [gateway-operator.konghq.com/v1beta1/ControlPlane, namespace: , name: gw2-zwk5n, uid: 3ec3d543-e2b4-46fa-a91b-1806b3473153] does not exist in namespace ""
45s         Normal    EnsuredLoadBalancer          service/dataplane-ingress-gw2-22pwd-7bj6m           Ensured load balancer
4s          Warning   Unhealthy                    pod/dataplane-gw2-22pwd-dvczf-f5948c849-t86ws       Readiness probe failed: HTTP probe failed with statuscode: 503

the nodes look like they have plenty of headroom:

k top nodes
NAME                              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
aks-default-42903236-vmss000002   620m         32%    1534Mi          28%
aks-default-42903236-vmss000003   90m          4%     1839Mi          34%

Attached you find the trace logs of a gateway creation which is stuck. kgo.txt

pmalek commented 2 months ago

OwnerRefInvalidNamespace issue is tracked in #72. FailedToCreateEndpoint is interesting 🤔

In any case the logs indicate a similar problem as before: perpetual patch to ValidatingWebhookConfiguration. We've had issues in the past where operator code would not fill in the defaults which would cause this cycle (compare generate with already existing resource would always yield a non empty diff) but that is covered now in https://github.com/Kong/gateway-operator/blob/a81e1218cc5007a1d0bd2ac69244141200e51cee/pkg/utils/kubernetes/resources/zz_generated_kic_validatingwebhookconfig.go#L63.

When I find more time I can try spinning my own Azure cluster for testing.

pmalek commented 2 months ago

@MatthiasWinzeler #239 is the issue that you've hit. Let's move the discussion there.

MatthiasWinzeler commented 2 months ago

@pmalek I am very glad to hear you found the issue. Let me know if I can test something!