Closed MatthiasWinzeler closed 2 months ago
Hi @MatthiasWinzeler,
I've just tried to reproduce it and I couldn't with 1.2.3 using your GatewayClass
and GatewayConfiguration
with 2 Gateway
s.
echo '
kind: Gateway
apiVersion: gateway.networking.k8s.io/v1
metadata:
name: gw1
namespace: default
spec:
gatewayClassName: kong
listeners:
- name: http
protocol: HTTP
port: 80
---
kind: Gateway
apiVersion: gateway.networking.k8s.io/v1
metadata:
name: gw2
namespace: default
spec:
gatewayClassName: kong
listeners:
- name: http
protocol: HTTP
port: 80
' | kubectl apply -f -
One thing that stood out to me is that you define kong/kubernetes-ingress-controller:3.1.3
in GatewayConfiguration
but the Gateway
is running kong/kubernetes-ingress-controller:3.1.2
which for me didn't occur ( I did get the expected image ).
Can you try looking into DataPlane
, ControlPlane
and Gateway
status
fields and see if there's anything there that could suggest a culprit? You can use this guide https://docs.konghq.com/gateway-operator/latest/production/monitoring/status/gateway/.
Please also include KGO debug logs ( you can enable those by setting --set env.zap_log_level=2
or --set env.zap_log_level=debug
, for latter is less verbose ).
We are running on AKS 1.28 with Cilium (which also provides a GW, maybe that interferes).
This shouldn't be relevant assuming that Cillium's controllers respect the GatewayClass
's controllerName
field.
@pmalek Thanks for getting back to me!
One thing that stood out to me is that you define kong/kubernetes-ingress-controller:3.1.3 in GatewayConfiguration but the Gateway is running kong/kubernetes-ingress-controller:3.1.2 which for me didn't occur ( I did get the expected image ).
You're right - in my debugging steps afterwards, I actually removed the whole controlPlaneOptions
part, which causes it to fall back to 3.1.2, but the issue remains the same. I think that should not matter, right?
I turned on debug logs and I realized that the control plane deployment is actually created, just way later. Sometimes it takes around 20 minutes, sometimes a little bit less. I captured the logs of a try that takes around 13 minutes - I put the whole log here: https://gist.github.com/MatthiasWinzeler/69469600275264989ddc7f3db4e10b5a
What's interesting is one place where the operator seems to be stuck:
kgo-gateway-operator-controller-manager-89d754fc5-w6sdm manager {"level":"debug","ts":"2024-04-24T15:22:12Z","logger":"controlplane.dataplaneProvisioning","msg":"dataplane config updated","controller":"gateway","controllerGroup":"gateway.networking.k8s.io","controllerKind":"Gateway","Gateway":{"name":"gw2","namespace":"default"},"namespace":"default","name":"gw2","reconcileID":"fbc7449a-b391-4039-b1c1-f816752b2e72","namespace":"default","name":"gw2"}
...
kgo-gateway-operator-controller-manager-89d754fc5-w6sdm manager {"level":"debug","ts":"2024-04-24T15:30:54Z","logger":"controlplane","msg":"deployment for ControlPlane created","controller":"controlplane","controllerGroup":"gateway-operator.konghq.com","controllerKind":"ControlPlane","ControlPlane":{"name":"gw2-b8z9d","namespace":"default"},"namespace":"default","name":"gw2-b8z9d","reconcileID":"52e8b757-a4e2-4369-89fe-b80b1008ad62","namespace":"default","name":"gw2-b8z9d","deployment":"controlplane-gw2-b8z9d-t22kl"}
It's waiting around 8 minutes before creating the deployment of the control plane. Any idea what could happen in this timeframe?
I can see lots of patching existing ValidatingWebhookConfiguration
in the meantime.
There's another wait of around 4 minutes later down:
kgo-gateway-operator-controller-manager-89d754fc5-w6sdm manager {"level":"debug","ts":"2024-04-24T15:31:05Z","logger":"controlplane.dataplaneProvisioning","msg":"dataplane config updated","controller":"gateway","controllerGroup":"gateway.networking.k8s.io","controllerKind":"Gateway","Gateway":{"name":"gw2","namespace":"default"},"namespace":"default","name":"gw2","reconcileID":"e384ced3-1772-427e-9f5b-19eab59b9ea5","namespace":"default","name":"gw2"}
kgo-gateway-operator-controller-manager-89d754fc5-w6sdm manager {"level":"debug","ts":"2024-04-24T15:35:54Z","logger":"controlplane","msg":"patching ControlPlane status","controller":"controlplane","controllerGroup":"gateway-operator.konghq.com","controllerKind":"ControlPlane","ControlPlane":{"name":"gw2-b8z9d","namespace":"default"},"namespace":"default","name":"gw2-b8z9d","reconcileID":"3b8ed356-8f94-4696-b3d2-8130a048eb1f","namespace":"default","name":"gw2-b8z9d","status":{"conditions":[{"type":"Ready","status":"True","observedGeneration":2,"lastTransitionTime":"2024-04-24T15:35:54Z","reason":"Ready","message":""},{"type":"Provisioned","status":"True","observedGeneration":2,"lastTransitionTime":"2024-04-24T15:35:54Z","reason":"PodsReady","message":"pods for all Deployments are ready"}]}}
It's waiting almost 5 minutes here. I thought maybe the cluster is overloaded, but the nodes are pretty idle.
I'll also added the output/status of the kube resources to the gist. Many thanks already for your help!
I wonder if the patching existing ValidatingWebhookConfiguration
could be due to some leftovers of the Kong Ingress Controller that was running in this cluster before...? maybe some webhook conf that's not properly cleaned up?
I wonder if the
patching existing ValidatingWebhookConfiguration
could be due to some leftovers of the Kong Ingress Controller that was running in this cluster before...? maybe some webhook conf that's not properly cleaned up?
Potentially but what's happening most likely is that something outside of KGO is updating ControlPlane
's ValidatingWebhookConfiguration
ObjectMeta
which is not properly enforced.
@MatthiasWinzeler You can test the fix using a nightly or a concrete sha based image: https://hub.docker.com/layers/kong/gateway-operator-oss/sha-9015ff6/images/sha256-9d85254612065a1f7ba702d8d595712c729db95cd8ac8d4f6441c3686497c826?context=explore.
@pmalek thanks!
I tried this image but the issue persists. For instance just now, I have a gateway that's stuck with Programmed = false for about 35 minutes.
To rule out any conflicts with older KIC versions that were on the cluster previously and Cilium, I deployed a vanilla, fresh AKS 1.28 cluster (without Cilium, but with the Azure CNI) and I get the same issue. Any idea how we could further investigate this?
I see. I don't have an Azure cluster readily available for testing but I'll see what I can do.
What you can do in meantime is try to find if the latest nightly image changed what's being logged and post if you've observed anything. You could also use something like https://github.com/ahmetb/kubectl-tree to print all the dependant objects of the Gateway
to see where we're stuck (is it the DataPlane
or ControlPlane
?).
I'd also look at the DataPlane
Service
, you can get it in the status:
kg dataplane -n default kong-dgw79 -o jsonpath-as-json='{.status}'
[
{
"addresses": [
{
"sourceType": "PrivateLoadBalancer",
"type": "IPAddress",
"value": "172.18.128.2"
},
{
"sourceType": "PrivateIP",
"type": "IPAddress",
"value": "10.96.199.53"
}
],
"conditions": [
{
"lastTransitionTime": "2024-04-29T09:12:47Z",
"message": "",
"observedGeneration": 1,
"reason": "Ready",
"status": "True",
"type": "Ready"
}
],
"readyReplicas": 2,
"replicas": 2,
"selector": "afbb8e60-996e-4f35-b733-ca743323da42",
"service": "dataplane-ingress-kong-dgw79-5cgqt"
}
]
and see if the Service
has an LB created for it and ready. It might be that it takes a while for the cloud provider to create an LB and all the necessary resources along with it.
hi @pmalek
it seems that the control plane is not getting created (and since the data plane requires a control plane to come up, it is stuck too). you can find all the YAML output of the related objects in the gist above.
I just captured another try on a fresh AKS cluster with the latest nightly build (image tag sha-1b2f7ee-amd64
) while it's stuck for 11 minutes and counting:
k get controlplane
NAME READY PROVISIONED
gw1-xtxhp True True
gw2-c6psk False False
k get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
controlplane-gw1-xtxhp-4ps99 1/1 1 1 44h
dataplane-gw1-5khtj-ksnp9 1/1 1 1 44h
dataplane-gw2-ffpn8-jn4xr 0/1 1 0 11m <-- stuck for 11m
k get controlplane -o wide
NAME READY PROVISIONED
gw1-xtxhp True True
gw2-c6psk False False <-- stuck
Please see the following gist for all output YAMLs and logs: https://gist.github.com/MatthiasWinzeler/80769bde7c20b31a67f0f88b1c7b9510
If it makes it easier for you, I'm also available for pairing if that's easier for you - or try to give you access to our AKS cluster :)
the output of the kubectl tree
for the stuck gateway is as follows (cool tool by the way!):
kubectl tree gateway gw2
NAMESPACE NAME READY REASON AGE
default Gateway/gw2 False DependenciesNotReady 62m
default ├─ControlPlane/gw2-c6psk False DependenciesNotReady 62m
default │ ├─Secret/controlplane-gw2-c6psk-4twpt - 62m
default │ ├─Secret/controlplane-gw2-c6psk-hxgph - 62m
default │ ├─Service/controlplane-webhook-gw2-c6psk-g4nmm - 62m
default │ │ └─EndpointSlice/controlplane-webhook-gw2-c6psk-g4nmm-sfrxl - 62m
default │ └─ServiceAccount/controlplane-gw2-c6psk-zf9dq - 62m
default └─DataPlane/gw2-ffpn8 False WaitingToBecomeReady 62m
default ├─Deployment/dataplane-gw2-ffpn8-jn4xr - 62m
default │ └─ReplicaSet/dataplane-gw2-ffpn8-jn4xr-76cc57c7df - 62m
default │ └─Pod/dataplane-gw2-ffpn8-jn4xr-76cc57c7df-kgb8n False ContainersNotReady 62m
default ├─Secret/dataplane-gw2-ffpn8-nvh94 - 62m
default ├─Service/dataplane-admin-gw2-ffpn8-792c7 - 62m
default │ └─EndpointSlice/dataplane-admin-gw2-ffpn8-792c7-d44j2 - 62m
default └─Service/dataplane-ingress-gw2-ffpn8-kwjmc - 62m
default └─EndpointSlice/dataplane-ingress-gw2-ffpn8-kwjmc-79lxz - 62m
It appears that we've been still pushing the images from our old repo (I've disabled that now) which is why you see:
kgo-gateway-operator-controller-manager-776ff5bb95-n5f78 manager {"level":"info","ts":"2024-04-30T08:36:42Z","logger":"setup","msg":"starting controller manager","release":"nightly-amd64","repo":"https://github.com/Kong/gateway-operator-archive.git","commit":"1b2f7ee305cdeb7e27bc66c990e8d32d36292f38"}
in the logs and not (exemplar output):
{"level":"info","ts":"2024-04-30T10:00:31Z","logger":"setup","msg":"starting controller manager","release":"1.2.3-arm64","repo":"https://github.com/Kong/gateway-operator.git","commit":"ab27c2e00e238b7efd6af674f3213da17d8dedb6"}
If you want to test nightly you can try:
which is the image for the latest commit in this repo: https://github.com/Kong/gateway-operator/commit/8f8c621c13db8165568df2c65f4a7ad0e11c4010
@pmalek good catch - I changed it and am using the image sha-8f8c621
now:
{"level":"info","ts":"2024-04-30T10:05:38Z","logger":"setup","msg":"starting controller manager","release":"nightly-amd64","repo":"https://github.com/Kong/gateway-operator.git","commit":"8f8c621c13db8165568df2c65f4a7ad0e11c4010"}
However, after deleting and creating the gateway again, it still is stuck:
kubectl tree gateway gw2
NAMESPACE NAME READY REASON AGE
default Gateway/gw2 False DependenciesNotReady 2m50s
default ├─ControlPlane/gw2-t4vhv False DependenciesNotReady 2m50s
default │ ├─Secret/controlplane-gw2-t4vhv-dr7w4 - 2m49s
default │ ├─Secret/controlplane-gw2-t4vhv-nd97l - 2m49s
default │ ├─Service/controlplane-webhook-gw2-t4vhv-2xxxk - 2m49s
default │ │ └─EndpointSlice/controlplane-webhook-gw2-t4vhv-2xxxk-6v52r - 2m49s
default │ └─ServiceAccount/controlplane-gw2-t4vhv-8m8mh - 2m50s
default └─DataPlane/gw2-n6dx5 False WaitingToBecomeReady 2m50s
default ├─Deployment/dataplane-gw2-n6dx5-2b9z6 - 2m50s
default │ └─ReplicaSet/dataplane-gw2-n6dx5-2b9z6-768b66c9f5 - 2m50s
default │ └─Pod/dataplane-gw2-n6dx5-2b9z6-768b66c9f5-mf7jw False ContainersNotReady 2m50s
default ├─Secret/dataplane-gw2-n6dx5-p88wm - 2m50s
default ├─Service/dataplane-admin-gw2-n6dx5-2c6c4 - 2m50s
default │ └─EndpointSlice/dataplane-admin-gw2-n6dx5-2c6c4-fxpb6 - 2m50s
default └─Service/dataplane-ingress-gw2-n6dx5-qlxtq - 2m50s
default └─EndpointSlice/dataplane-ingress-gw2-n6dx5-qlxtq-jqpll - 2m50s
Do you need any other debug info?
Not sure ATM.
What seems weird is that the Deployment
for ControlPlane
doesn't seem to get created.
The only way I was able to reproduce that was with setting too high resource request (which would be higher than the default limit) but even then this got logged
2024-04-30T14:40:45.464+0200 - ERROR - Reconciler error - {"controller": "controlplane", "controllerGroup": "gateway-operator.konghq.com", "controllerKind": "ControlPlane", "ControlPlane": {"name":"kong-psc94","namespace":"default"}, "namespace": "default", "name": "kong-psc94", "reconcileID": "f30d24e7-f068-4876-a11d-3076b0a79696", "error": "failed creating ControlPlane Deployment : Deployment.apps \"controlplane-kong-psc94-c92xc\" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: \"16000Mi\": must be less than or equal to memory limit of 100Mi"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/Users/patryk.malek@konghq.com/.gvm/pkgsets/go1.22.2/global/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/Users/patryk.malek@konghq.com/.gvm/pkgsets/go1.22.2/global/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/Users/patryk.malek@konghq.com/.gvm/pkgsets/go1.22.2/global/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227
Setting both limits and requests too high would still create the Deployment
and then its Pod
s would get into a Pending
state.
Can you look into
Gateway
, ControlPlane
or DataPlane
)--set env.zap_log_level=2
) and resend those with failing to get created ControlPlane
Deployment
?
For reference working objects hierarchy under Gateway
:
NAMESPACE NAME READY REASON AGE
default Gateway/kong True Ready 132m
default ├─ControlPlane/kong-vs42s True Ready 130m
default │ ├─Deployment/controlplane-kong-vs42s-9pj95 - 130m
default │ │ ├─ReplicaSet/controlplane-kong-vs42s-9pj95-64957d6f86 - 130m
default │ │ └─ReplicaSet/controlplane-kong-vs42s-9pj95-7ccbd87d9 - 30s
default │ │ └─Pod/controlplane-kong-vs42s-9pj95-7ccbd87d9-zltmk True 30s
default │ ├─Secret/controlplane-kong-vs42s-485pr - 130m
default │ ├─Secret/controlplane-kong-vs42s-w2grp - 130m
default │ ├─Service/controlplane-webhook-kong-vs42s-vzqbc - 130m
default │ │ └─EndpointSlice/controlplane-webhook-kong-vs42s-vzqbc-sbptb - 130m
default │ └─ServiceAccount/controlplane-kong-vs42s-929sj - 130m
default ├─DataPlane/kong-58h4l True Ready 132m
default │ ├─Deployment/dataplane-kong-58h4l-c4lpq - 132m
default │ │ └─ReplicaSet/dataplane-kong-58h4l-c4lpq-55ddd75fc7 - 132m
default │ │ ├─Pod/dataplane-kong-58h4l-c4lpq-55ddd75fc7-7tnmb True 132m
default │ │ └─Pod/dataplane-kong-58h4l-c4lpq-55ddd75fc7-swqk5 True 132m
default │ ├─Secret/dataplane-kong-58h4l-rk5xh - 132m
default │ ├─Service/dataplane-admin-kong-58h4l-ftrx5 - 132m
default │ │ └─EndpointSlice/dataplane-admin-kong-58h4l-ftrx5-g94wt - 132m
default │ └─Service/dataplane-ingress-kong-58h4l-wt2zk - 132m
default │ └─EndpointSlice/dataplane-ingress-kong-58h4l-wt2zk-7jqb8 - 132m
default └─NetworkPolicy/kong-58h4l-limit-admin-api-dlsll - 132m
@pmalek
There are some interesting warnings in the events:
61s Normal KongConfigurationSucceeded pod/controlplane-gw1-xtxhp-4ps99-65864757d8-cj9z2 successfully applied Kong configuration to https://10-0-4-37.dataplane-admin-gw1-5khtj-n4cv4.default.svc:8444
55s Normal EnsuringLoadBalancer service/dataplane-ingress-gw2-22pwd-7bj6m Ensuring load balancer
55s Normal Scheduled pod/dataplane-gw2-22pwd-dvczf-f5948c849-t86ws Successfully assigned default/dataplane-gw2-22pwd-dvczf-f5948c849-t86ws to aks-default-42903236-vmss000002
55s Warning FailedToCreateEndpoint endpoints/dataplane-ingress-gw2-22pwd-7bj6m Failed to create endpoint for service default/dataplane-ingress-gw2-22pwd-7bj6m: endpoints "dataplane-ingress-gw2-22pwd-7bj6m" already exists
55s Normal SuccessfulCreate replicaset/dataplane-gw2-22pwd-dvczf-f5948c849 Created pod: dataplane-gw2-22pwd-dvczf-f5948c849-t86ws
55s Normal ScalingReplicaSet deployment/dataplane-gw2-22pwd-dvczf Scaled up replica set dataplane-gw2-22pwd-dvczf-f5948c849 to 1
54s Normal Created pod/dataplane-gw2-22pwd-dvczf-f5948c849-t86ws Created container proxy
54s Normal Started pod/dataplane-gw2-22pwd-dvczf-f5948c849-t86ws Started container proxy
54s Normal Pulled pod/dataplane-gw2-22pwd-dvczf-f5948c849-t86ws Container image "kong:3.6.1" already present on machine
54s Warning OwnerRefInvalidNamespace clusterrolebinding/controlplane-gw2-zwk5n-sktws ownerRef [gateway-operator.konghq.com/v1beta1/ControlPlane, namespace: , name: gw2-zwk5n, uid: 3ec3d543-e2b4-46fa-a91b-1806b3473153] does not exist in namespace ""
54s Warning OwnerRefInvalidNamespace clusterrole/gw2-zwk5n-5h7tn ownerRef [gateway-operator.konghq.com/v1beta1/ControlPlane, namespace: , name: gw2-zwk5n, uid: 3ec3d543-e2b4-46fa-a91b-1806b3473153] does not exist in namespace ""
53s Warning OwnerRefInvalidNamespace validatingwebhookconfiguration/gw2-zwk5n ownerRef [gateway-operator.konghq.com/v1beta1/ControlPlane, namespace: , name: gw2-zwk5n, uid: 3ec3d543-e2b4-46fa-a91b-1806b3473153] does not exist in namespace ""
45s Normal EnsuredLoadBalancer service/dataplane-ingress-gw2-22pwd-7bj6m Ensured load balancer
4s Warning Unhealthy pod/dataplane-gw2-22pwd-dvczf-f5948c849-t86ws Readiness probe failed: HTTP probe failed with statuscode: 503
the nodes look like they have plenty of headroom:
k top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
aks-default-42903236-vmss000002 620m 32% 1534Mi 28%
aks-default-42903236-vmss000003 90m 4% 1839Mi 34%
Attached you find the trace logs of a gateway creation which is stuck. kgo.txt
OwnerRefInvalidNamespace
issue is tracked in #72. FailedToCreateEndpoint
is interesting 🤔
In any case the logs indicate a similar problem as before: perpetual patch to ValidatingWebhookConfiguration
. We've had issues in the past where operator code would not fill in the defaults which would cause this cycle (compare generate with already existing resource would always yield a non empty diff) but that is covered now in https://github.com/Kong/gateway-operator/blob/a81e1218cc5007a1d0bd2ac69244141200e51cee/pkg/utils/kubernetes/resources/zz_generated_kic_validatingwebhookconfig.go#L63.
When I find more time I can try spinning my own Azure cluster for testing.
@MatthiasWinzeler #239 is the issue that you've hit. Let's move the discussion there.
@pmalek I am very glad to hear you found the issue. Let me know if I can test something!
Current Behavior
We want to deploy two
Gateway
s using the Gateway operator. While the first gateway is programmed successfully, the second gateway never comes up and seems to be stuck somewhere.Expected Behavior
Both
Gateway
s are programmed successfully.Steps To Reproduce
We are following the Getting started documentation: https://docs.konghq.com/gateway-operator/latest/get-started/kic/install/
More precisely, this is the steps we run and which can be used to reproduce the problem:
Then, we create the first gateway:
This gateway comes up successfully:
Then, we create the second gateway:
However, this gateway never comes up:
The data plane and control plane object seem to be not provisioned properly:
The logs show the following:
It complains about not finding any ingress services, but it is clearly there - maybe some bug in the operator?
Operator Version
Tried with
image.tag=1.2
(as described in the getting started) and also tried the latest--set image.tag=1.2.3 --set image.repository=docker.io/kong/gateway-operator-oss
but the issue remains the same.kubectl version
We are running on AKS 1.28 with Cilium (which also provides a GW, maybe that interferes).