Open twuyts opened 1 year ago
@twuyts - The default webhook server port has been changed from 443
to 9443
in the Koperator implementation by https://github.com/banzaicloud/koperator/pull/912, therefore, to successfully upgrade it to the latest version, the webhook port in the validatingwebhookconfigurations/kafka-operator-validating-webhook
should be updated accordingly.
issue in the helm chart then?
No scratch that. I spoke too soon.
@twuyts - The default webhook server port has been changed from
443
to9443
in the Koperator implementation by #912, therefore, to successfully upgrade it to the latest version, the webhook port in thevalidatingwebhookconfigurations/kafka-operator-validating-webhook
should be updated accordingly.
My bad - I just took a closer look into the deployment manifests, and it looks like we only changed the target port of the webhook server, and the validatingwebhookconfigurations
uses the service port (443) that you linked to send the requests to the webhook server. So the changed in #912 shouldn't cause the issue.
@twuyts Can you provide the steps that you took to perform the upgrade? I can try and see if I can reproduce the issue.
edit: I was suspecting you might need to manually update the Service
so the request can go to the corresponding named targetPort webhook-server
The upgrade is managed through the helm-controller, part of Flux, the solution we use for continuous delivery. Basically what we did was update the kubernetes manifest for the koperator CRDs, and bump the version of the helm chart from v0.24.1 to v0.25.1 in our git repository. This is picked up automatically by the the Flux helmcontroller running on the k8s cluster, which then does the upgrade. Unfortunately, I have no idea on exactly how the helmcontroller does that.
Seeing a similar errors after upgrading to v0.25.1
We use helm chart:
NAME CHART VERSION APP VERSION
banzaicloud-stable/kafka-operator 0.25.1 v0.25.1
We have an istio-proxy sidecar in operator pod. istio-proxy version: banzaicloud istio-proxyv2:1.15.0
Operator logs have entries like:
{"level":"error","ts":"2023-10-04T10:06:44.405Z","msg":"Reconciler error","controller":"KafkaCluster","controllerGroup":"kafka.banzaicloud.io","controllerKind":"KafkaCluster","KafkaCluster":{"name":"sample-svc-kafka","namespace":"cloud"},"namespace":"cloud","name":"sample-svc-kafka","reconcileID":"a31ff20c-91bf-4e85-a71a-85a0d1d57917","error":"Internal error occurred: failed calling webhook \"kafkaclusters.kafka.banzaicloud.io\": failed to call webhook: Post \"https://kafka-operator-operator.kafka.svc:443/validate-kafka-banzaicloud-io-v1beta1-kafkacluster?timeout=10s\": context deadline exceeded","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235"}
For us, using PERMISSIVE mode is not acceptable, only STRICT, but that doesn't seem an issue, because a connection to a webhook from any other pod (with istio) is successful:
~ $ curl -k -XPOST -vvv https://kafka-operator-operator.kafka.svc:443/validate-kafka-banzaicloud-io-v1beta1-kafkacluster?timeout=10s
* processing: https://kafka-operator-operator.kafka.svc:443/validate-kafka-banzaicloud-io-v1beta1-kafkacluster?timeout=10s
* Trying 10.132.41.140:443...
* Connected to kafka-operator-operator.kafka.svc (10.132.41.140) port 443
* ALPN: offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN: server accepted h2
* Server certificate:
* subject: CN=kafka-operator-operator.kafka.svc
* start date: Aug 28 10:40:28 2023 GMT
* expire date: Aug 27 10:40:28 2024 GMT
* issuer: CN=kafka-operator-ca
* SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* using HTTP/2
* h2 [:method: POST]
* h2 [:scheme: https]
* h2 [:authority: kafka-operator-operator.kafka.svc]
* h2 [:path: /validate-kafka-banzaicloud-io-v1beta1-kafkacluster?timeout=10s]
* h2 [user-agent: curl/8.2.1]
* h2 [accept: */*]
* Using Stream ID: 1
> POST /validate-kafka-banzaicloud-io-v1beta1-kafkacluster?timeout=10s HTTP/2
> Host: kafka-operator-operator.kafka.svc
> User-Agent: curl/8.2.1
> Accept: */*
>
< HTTP/2 200
< content-type: text/plain; charset=utf-8
< content-length: 128
< date: Wed, 04 Oct 2023 11:58:49 GMT
<
{"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}
* Connection #0 to host kafka-operator-operator.kafka.svc left intact
So doesn't seem to be an issue like https://github.com/istio/istio/issues/39290
the same reconcile error with PERMISSIVE mode
Pod: kafka-operator-548fbb9fd4-vgdbt
Pod Revision: asm-managed
Pod Ports: 15090 (istio-proxy), 8443 (kube-rbac-proxy), 9443 (manager), 8080 (manager), 9001 (manager)
--------------------
Service: kafka-operator-alertmanager
Port: http-alerts 9001/HTTP targets pod port 9001
--------------------
Service: kafka-operator-authproxy
Port: https 8443/HTTPS targets pod port 8443
--------------------
Service: kafka-operator-operator
Port: https 443/HTTPS targets pod port 9443
--------------------
Effective PeerAuthentication:
Workload mTLS mode: PERMISSIVE
tried also to exclude 9443 port:
traffic.sidecar.istio.io/excludeOutboundPorts: "9443"
traffic.sidecar.istio.io/excludeInboundPorts: "9443"
no luck
Description
After upgrading from koperator v0.23.1 to v0.25.1, the validating webhook for kafkaclusters fails:
The same error is thrown when manually updating a kafkacluster resource
Expected Behavior
The error should not occur.
Actual Behavior
A timeout error is thrown.
Affected Version
v0.25.1
Steps to Reproduce
kubectl -n kafka apply -f config/samples/simplekafkacluster.yaml
Error from server (InternalError): error when creating "tmp/cluster.yaml": Internal error occurred: failed calling webhook "kafkaclusters.kafka.banzaicloud.io": failed to call webhook: Post "https://kafka-operator-operator.kafka.svc:443/validate-kafka-banzaicloud-io-v1beta1-kafkacluster?timeout=10s": context deadline exceeded
I've checked the webhooks config, and that looks fine:
I tried to call the webhook from within a debug container attached to the koperator pod, using the CA shown above:
for the time being, I disabled the webhook in the helm chart, so I am not blocked.
Checklist
[X] I have read the contributing guidelines
[X] I have verified this does not duplicate an existing issue