banzaicloud / koperator

Oh no! Yet another Apache Kafka operator for Kubernetes
Apache License 2.0
789 stars 198 forks source link

Validating webhook times out #1054

Open twuyts opened 1 year ago

twuyts commented 1 year ago

Description

After upgrading from koperator v0.23.1 to v0.25.1, the validating webhook for kafkaclusters fails:

{"level":"info","ts":"2023-08-29T14:33:08.513Z","msg":"Internal error occurred: failed calling webhook \"kafkaclusters.kafka.banzaicloud.io\": failed to call webhook: Post \"https://kafka-operator-operator.kafka.svc:443/validate-kafka-banzaicloud-io-v1beta1-kafkacluster?timeout=10s\": context deadline exceeded","controller":"KafkaCluster","controllerGroup":"kafka.banzaicloud.io","controllerKind":"KafkaCluster","KafkaCluster":{"name":"tt","namespace":"kafka"},"namespace":"kafka","name":"tt","reconcileID":"e000ec40-3da0-4f98-b202-d62126d22a10"}

The same error is thrown when manually updating a kafkacluster resource

Expected Behavior

The error should not occur.

Actual Behavior

A timeout error is thrown.

Affected Version

v0.25.1

Steps to Reproduce

  1. kubectl -n kafka apply -f config/samples/simplekafkacluster.yaml

Error from server (InternalError): error when creating "tmp/cluster.yaml": Internal error occurred: failed calling webhook "kafkaclusters.kafka.banzaicloud.io": failed to call webhook: Post "https://kafka-operator-operator.kafka.svc:443/validate-kafka-banzaicloud-io-v1beta1-kafkacluster?timeout=10s": context deadline exceeded

panyuenlau commented 1 year ago

@twuyts - The default webhook server port has been changed from 443 to 9443 in the Koperator implementation by https://github.com/banzaicloud/koperator/pull/912, therefore, to successfully upgrade it to the latest version, the webhook port in the validatingwebhookconfigurations/kafka-operator-validating-webhook should be updated accordingly.

twuyts commented 1 year ago

issue in the helm chart then? https://github.com/banzaicloud/koperator/blob/7b60ac029bb777eb5a78e3a3420dc6342aa298c5/charts/kafka-operator/templates/operator-service.yaml#L30

twuyts commented 1 year ago

issue in the helm chart then?

https://github.com/banzaicloud/koperator/blob/7b60ac029bb777eb5a78e3a3420dc6342aa298c5/charts/kafka-operator/templates/operator-service.yaml#L30

No scratch that. I spoke too soon.

panyuenlau commented 1 year ago

@twuyts - The default webhook server port has been changed from 443 to 9443 in the Koperator implementation by #912, therefore, to successfully upgrade it to the latest version, the webhook port in the validatingwebhookconfigurations/kafka-operator-validating-webhook should be updated accordingly.

My bad - I just took a closer look into the deployment manifests, and it looks like we only changed the target port of the webhook server, and the validatingwebhookconfigurations uses the service port (443) that you linked to send the requests to the webhook server. So the changed in #912 shouldn't cause the issue.

panyuenlau commented 1 year ago

@twuyts Can you provide the steps that you took to perform the upgrade? I can try and see if I can reproduce the issue.

edit: I was suspecting you might need to manually update the Service so the request can go to the corresponding named targetPort webhook-server

twuyts commented 1 year ago

The upgrade is managed through the helm-controller, part of Flux, the solution we use for continuous delivery. Basically what we did was update the kubernetes manifest for the koperator CRDs, and bump the version of the helm chart from v0.24.1 to v0.25.1 in our git repository. This is picked up automatically by the the Flux helmcontroller running on the k8s cluster, which then does the upgrade. Unfortunately, I have no idea on exactly how the helmcontroller does that.

vitalii-buchyn-exa commented 1 year ago

Seeing a similar errors after upgrading to v0.25.1

We use helm chart:

NAME                                                CHART VERSION   APP VERSION
banzaicloud-stable/kafka-operator                   0.25.1          v0.25.1

We have an istio-proxy sidecar in operator pod. istio-proxy version: banzaicloud istio-proxyv2:1.15.0

Operator logs have entries like:

{"level":"error","ts":"2023-10-04T10:06:44.405Z","msg":"Reconciler error","controller":"KafkaCluster","controllerGroup":"kafka.banzaicloud.io","controllerKind":"KafkaCluster","KafkaCluster":{"name":"sample-svc-kafka","namespace":"cloud"},"namespace":"cloud","name":"sample-svc-kafka","reconcileID":"a31ff20c-91bf-4e85-a71a-85a0d1d57917","error":"Internal error occurred: failed calling webhook \"kafkaclusters.kafka.banzaicloud.io\": failed to call webhook: Post \"https://kafka-operator-operator.kafka.svc:443/validate-kafka-banzaicloud-io-v1beta1-kafkacluster?timeout=10s\": context deadline exceeded","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235"}

For us, using PERMISSIVE mode is not acceptable, only STRICT, but that doesn't seem an issue, because a connection to a webhook from any other pod (with istio) is successful:

~ $ curl -k -XPOST -vvv https://kafka-operator-operator.kafka.svc:443/validate-kafka-banzaicloud-io-v1beta1-kafkacluster?timeout=10s
* processing: https://kafka-operator-operator.kafka.svc:443/validate-kafka-banzaicloud-io-v1beta1-kafkacluster?timeout=10s
*   Trying 10.132.41.140:443...
* Connected to kafka-operator-operator.kafka.svc (10.132.41.140) port 443
* ALPN: offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=kafka-operator-operator.kafka.svc
*  start date: Aug 28 10:40:28 2023 GMT
*  expire date: Aug 27 10:40:28 2024 GMT
*  issuer: CN=kafka-operator-ca
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* using HTTP/2
* h2 [:method: POST]
* h2 [:scheme: https]
* h2 [:authority: kafka-operator-operator.kafka.svc]
* h2 [:path: /validate-kafka-banzaicloud-io-v1beta1-kafkacluster?timeout=10s]
* h2 [user-agent: curl/8.2.1]
* h2 [accept: */*]
* Using Stream ID: 1
> POST /validate-kafka-banzaicloud-io-v1beta1-kafkacluster?timeout=10s HTTP/2
> Host: kafka-operator-operator.kafka.svc
> User-Agent: curl/8.2.1
> Accept: */*
>
< HTTP/2 200
< content-type: text/plain; charset=utf-8
< content-length: 128
< date: Wed, 04 Oct 2023 11:58:49 GMT
<
{"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}
* Connection #0 to host kafka-operator-operator.kafka.svc left intact

So doesn't seem to be an issue like https://github.com/istio/istio/issues/39290

vitalii-buchyn-exa commented 11 months ago

the same reconcile error with PERMISSIVE mode

Pod: kafka-operator-548fbb9fd4-vgdbt
   Pod Revision: asm-managed
   Pod Ports: 15090 (istio-proxy), 8443 (kube-rbac-proxy), 9443 (manager), 8080 (manager), 9001 (manager)
--------------------
Service: kafka-operator-alertmanager
   Port: http-alerts 9001/HTTP targets pod port 9001
--------------------
Service: kafka-operator-authproxy
   Port: https 8443/HTTPS targets pod port 8443
--------------------
Service: kafka-operator-operator
   Port: https 443/HTTPS targets pod port 9443
--------------------
Effective PeerAuthentication:
   Workload mTLS mode: PERMISSIVE

tried also to exclude 9443 port:

traffic.sidecar.istio.io/excludeOutboundPorts: "9443"
traffic.sidecar.istio.io/excludeInboundPorts: "9443"

no luck