banzaicloud / koperator

Oh no! Yet another Apache Kafka operator for Kubernetes
Apache License 2.0
784 stars 195 forks source link

http2 webhooks cause big issues in Kubernetes #359

Closed AceHack closed 2 years ago

AceHack commented 4 years ago

Describe the bug There is a known bug in Kubernetes well golang actually that causes a lot of instability with http2 webhooks. See related issue for more details. https://github.com/kubernetes/kubernetes/issues/80313

Steps to reproduce the issue: Install Kafka operator. Reboot node Kafka webhook is running on. It takes EKS about 15-20 minutes to recover and have the ability to use the Kafka webhook again.

Expected behavior Kubernetes/Webhook to recover in a few seconds, not several minutes.

Additional context An easy fix would be to run the following command.

kubectl set env -n kafka deployment/kafka-knative-operator-kafka-operator-operator GODEBUG=http2server=0

This command will disable any go code from using http2 for its server. This works and fixes many other webhooks like istio and knative but when I run this on this operator I start getting errors on every webhook invocation.

The error is Unexpected EOF

Please update code to allow disabling http2 on webhooks.

baluchicken commented 4 years ago

Hi, I tried to reproduce your error, and I am not sure if I succeeded.


I used your command to set the environment variable.

I killed the node where the operator is running so it got rescheduled to a different one. I checked the logs and I do see error Unexpected EOF but only one time, and its coming from leaderelection.go

leaderelection.go:331] error retrieving resource lock kafka/controller-leader-election-helper: Get https://10.10.0.1:443/api/v1/namespaces/kafka/configmaps/controller-leader-election-helper: unexpected EOF

As far as I can tell this error does not affect the operator.
It will try to acquire this lease once again, and for me it succeeded.

successfully acquired lease kafka/controller-leader-election-helper

After restart I also applied multiple KafkaTopic CRs which eventually caused Webhook invocations, but everything is succeeded for me.

Can you please share the whole log from the operator, so we can help you with the investigation.

AceHack commented 4 years ago

I'll try and reproduce but it happens for me continuously whenever I try to create or delete topics. They always fail.

AceHack commented 4 years ago

I keep getting this error over and over in a loop when trying to create a topic

 manager 2020-04-28T01:19:42.718Z    DEBUG    controllers.KafkaCluster    Reconciling    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knative-clust
 manager 2020-04-28T01:19:42.718Z    DEBUG    controllers.KafkaCluster    Reconciled    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knative-cluste
 manager 2020-04-28T01:19:42.718Z    DEBUG    controllers.KafkaCluster    Reconciling    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knative-clust
 manager 2020-04-28T01:19:42.718Z    DEBUG    controllers.KafkaCluster    Reconciled    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knative-cluste
 manager 2020-04-28T01:19:42.718Z    DEBUG    controllers.KafkaCluster    Reconciling    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knative-clust
 manager 2020-04-28T01:19:42.718Z    DEBUG    controllers.KafkaCluster    resource is in sync    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knati
 manager 2020-04-28T01:19:42.718Z    DEBUG    controllers.KafkaCluster    Reconciled    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knative-cluste
 manager 2020-04-28T01:19:42.718Z    DEBUG    controllers.KafkaCluster    Reconciling    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knative-clust
 manager 2020-04-28T01:19:42.719Z    DEBUG    controllers.KafkaCluster    resource is in sync    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knati
 manager 2020-04-28T01:19:42.719Z    DEBUG    controllers.KafkaCluster    Reconciled    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knative-cluste
 manager 2020-04-28T01:19:42.719Z    DEBUG    controllers.KafkaCluster    Reconciling    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knative-clust
 manager 2020-04-28T01:19:42.728Z    INFO    controllers.KafkaCluster    Kafka cluster state updated    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafk
 manager 2020-04-28T01:19:42.728Z    DEBUG    controllers.KafkaCluster    resource is in sync    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knati
 manager 2020-04-28T01:19:42.728Z    DEBUG    controllers.KafkaCluster    resource is in sync    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knati
 manager 2020-04-28T01:19:42.729Z    DEBUG    controllers.KafkaCluster    searching with label because name is empty    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Requ
 manager 2020-04-28T01:19:42.729Z    DEBUG    controllers.KafkaCluster    resource is in sync    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knati
 manager 2020-04-28T01:19:42.729Z    DEBUG    controllers.KafkaCluster    searching with label because name is empty    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Requ
 manager 2020-04-28T01:19:42.729Z    DEBUG    controllers.KafkaCluster    resource is in sync    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knati
 manager 2020-04-28T01:19:42.729Z    DEBUG    controllers.KafkaCluster    searching with label because name is empty    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Requ
 manager 2020-04-28T01:19:42.729Z    DEBUG    controllers.KafkaCluster    resource is in sync    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knati
 manager 2020-04-28T01:19:42.730Z    DEBUG    controllers.KafkaCluster    resource is in sync    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knati
 manager 2020-04-28T01:19:42.730Z    DEBUG    controllers.KafkaCluster    searching with label because name is empty    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Requ
 manager 2020-04-28T01:19:42.755Z    INFO    controllers.KafkaCluster    Kafka cluster state updated    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafk
 manager 2020-04-28T01:19:42.759Z    DEBUG    controllers.KafkaCluster    resource is in sync    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knati
 manager 2020-04-28T01:19:42.813Z    DEBUG    controllers.KafkaCluster    resource is in sync    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knati
 manager 2020-04-28T01:19:42.813Z    DEBUG    controllers.KafkaCluster    searching with label because name is empty    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Requ
 manager 2020-04-28T01:19:42.845Z    INFO    controllers.KafkaCluster    Kafka cluster state updated    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafk
 manager 2020-04-28T01:19:42.848Z    DEBUG    controllers.KafkaCluster    resource is in sync    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knati
 manager 2020-04-28T01:19:42.905Z    DEBUG    controllers.KafkaCluster    resource is in sync    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knati
 manager 2020-04-28T01:19:42.905Z    DEBUG    controllers.KafkaCluster    searching with label because name is empty    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Requ
 manager 2020-04-28T01:19:42.934Z    INFO    controllers.KafkaCluster    Kafka cluster state updated    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafk
 manager 2020-04-28T01:19:42.938Z    DEBUG    controllers.KafkaCluster    resource is in sync    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knati
 manager 2020-04-28T01:19:43.031Z    DEBUG    controllers.KafkaCluster    Reconciled    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knative-cluste
 manager 2020-04-28T01:19:43.032Z    DEBUG    controllers.KafkaCluster    Reconciling    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knative-clust
 manager 2020-04-28T01:19:43.051Z    INFO    controllers.KafkaCluster    CR status updated    {"Request.Namespace": "2269-kafka-knative/kafka-knative-cluster", "Request.Name": "kafka-knative-
 manager 2020-04-28T01:19:43.051Z    INFO    controllers.KafkaCluster    could not create cruise control topic: Internal error occurred: failed calling webhook "kafkatopics.kafka.banzaicloud.
 manager 2020-04-28T01:19:43.051Z    ERROR    controller-runtime.controller    Reconciler error    {"controller": "KafkaCluster", "request": "2269-kafka-knative/kafka-knative-cluster", "error
 manager github.com/go-logr/zapr.(*zapLogger).Error
 manager     /go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128
 manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
 manager     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:258
 manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
 manager     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:232
 manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
 manager     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:211
 manager k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
 manager     /go/pkg/mod/k8s.io/apimachinery@v0.17.3/pkg/util/wait/wait.go:152
 manager k8s.io/apimachinery/pkg/util/wait.JitterUntil
 manager     /go/pkg/mod/k8s.io/apimachinery@v0.17.3/pkg/util/wait/wait.go:153
 manager k8s.io/apimachinery/pkg/util/wait.Until
 manager     /go/pkg/mod/k8s.io/apimachinery@v0.17.3/pkg/util/wait/wait.go:88
AceHack commented 4 years ago

logs.txt

leader-us commented 3 years ago

{"level":"error","ts":"2020-11-16T03:29:39.941Z","logger":"controller","msg":"Reconciler error","reconcilerGroup":"kafka.banzaicloud.io","reconcilerKind":"KafkaCluster","controller":"KafkaCluster","name":"kafka","namespace":"default","error":"could not create cruise control topic: Internal error occurred: failed calling webhook \"kafkatopics.kafka.banzaicloud.io\": Post https://webhook-service.system.svc:443/validate?timeout=30s: service \"webhook-service\" not found","stacktrace":"github.com/go-logr/zapr.(zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller)

stoader commented 3 years ago

@leader-us can you provide the output of the following commands:

kubectl get svc -n kafka
kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io -lapp.kubernetes.io/instance=kafka-operator -o yaml
kubectl get pod -n kafka
leader-us commented 3 years ago

@leader-us can you provide the output of the following commands:

kubectl get svc -n kafka
kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io -lapp.kubernetes.io/instance=kafka-operator -o yaml
kubectl get pod -n kafka

[root@localhost ~]# kubectl get svc -n kafka NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kafka-operator-alertmanager ClusterIP 169.169.247.64 9001/TCP 4d20h kafka-operator-webhook-service ClusterIP 169.169.108.220 443/TCP 4d20h

kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io -lapp.kubernetes.io/instance=kafka-operator -o yaml apiVersion: v1 items: [] kind: List metadata: resourceVersion: "" selfLink: ""

kubectl get pod -n kafka NAME READY STATUS RESTARTS AGE kafka-operator-controller-manager-7b89fc746f-87n4v 1/1 Running 0 4d20h

I found there is a webhook service in namespace system ,but no related pods in that namespace [root@localhost ~]# kubectl -n system get svc -o yaml apiVersion: v1 items:

this service defined in config/manifests.yaml and service.yaml


apiVersion: admissionregistration.k8s.io/v1beta1 kind: ValidatingWebhookConfiguration metadata: creationTimestamp: null name: validating-webhook-configuration webhooks:

leader-us commented 3 years ago

{"level":"error","ts":"2020-11-16T09:02:14.701Z","logger":"setup","msg":"problem running manager","error":"open /etc/webhook/certs/tls.crt: no such file or directory","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\nmain.main\n\t/workspace/main.go:178\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}

latest error ,can't find cert

[root@localhost ~]# kubectl get all -n system NAME READY STATUS RESTARTS AGE pod/controller-manager-b977f57d5-mzwwq 0/1 CrashLoopBackOff 4 3m57s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/webhook-service ClusterIP 169.169.74.84 443/TCP 64m

NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/controller-manager 0/1 1 0 36m

NAME DESIRED CURRENT READY AGE replicaset.apps/controller-manager-b977f57d5 1 1 0 36m

stoader commented 3 years ago

@leader-us you kafka-operator deployment seems to have incorrect config. How did you deploy kafka-operator?

leader-us commented 3 years ago

in dir config/overlays/certmanager-enabled I run following comand to install operator kubectrl apply -f -k .

[root@localhost certmanager-enabled]# kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE cert-manager cert-manager-9b8969d86-wvhvg 1/1 Running 0 5d1h cert-manager cert-manager-cainjector-8545fdf87c-r8xtm 1/1 Running 0 5d1h cert-manager cert-manager-webhook-8c5db9fb6-8jskt 1/1 Running 0 5d1h default prometheus-operator-86b9f8646b-l2wwh 1/1 Running 0 5d default zk-with-istio-0 1/1 Running 1 5d7h default zk-with-istio-1 0/1 Running 2 3m50s default zk-with-istio-2 1/1 Running 2 5d2h default zookeeper-operator-8fd88c877-rrzhk 1/1 Running 1 5d7h kafka certman-controller-manager-6957f6c9ff-6sh9j 1/1 Running 0 108s kube-system calico-kube-controllers-5487f898d7-vt4mx 1/1 Running 8 221d kube-system calico-node-9pdpj 1/1 Running 7 221d kube-system coredns-68c75b6549-bnmbm 1/1 Running 7 460d

stoader commented 3 years ago

@leader-us I'd suggest to start with a new K8s cluster and use Helm to deploy kafka operator (https://banzaicloud.com/docs/supertubes/kafka-operator/install-kafka-operator/#kafka-operator-helm) as there might be an issue with the kustomize files.

leader-us commented 3 years ago

[root@localhost samples]# kubectl apply -f example-topic.yaml Error from server (InternalError): error when creating "example-topic.yaml": Internal error occurred: failed calling webhook "kafkatopics.kafka.banzaicloud.io": Post https://kafka-operator-webhook-service.kafka.svc:443/validate?timeout=30s: x509: certificate is valid for .kafka-headless.kafka.svc.cluster.local, kafka-headless, .kafka-headless, kafka-headless.kafka, not kafka-operator-webhook-service.kafka.svc

stoader commented 3 years ago

@leader-us can you describe the exact steps you followed to deploy kafka-operator using helm ?

leader-us commented 3 years ago

### I following your helm install steps , but found error again

{"level":"info","ts":"2020-11-17T02:18:32.635Z","logger":"controllers.KafkaCluster","msg":"could not create cruise control topic: Internal error occurred: failed calling webhook \"kafkatopics.kafka.banzaicloud.io\": Post https://kafka-operator-webhook-service.kafka.svc:443/validate?timeout=30s: service \"kafka-operator-webhook-service\" not found","Request.Namespace":"default/kafka","Request.Name":"kafka"}

install steps helm repo add banzaicloud-stable https://kubernetes-charts.banzaicloud.com/

Using helm3

helm install kafka-operator --namespace=kafka banzaicloud-stable/kafka-operator kubectl create -n kafka -f config/samples/simplekafkacluster.yaml

If prometheus operator installed create the ServiceMonitors

kubectl create -n kafka -f config/samples/kafkacluster-prometheus.yaml

[root@localhost ~]# kubectl get svc -n kafka

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager ClusterIP 169.169.61.209 9001/TCP 17h certman-alertmanager ClusterIP 169.169.95.249 9001/TCP 16h certman-webhook-service ClusterIP 169.169.194.136 443/TCP 16h kafka-operator-alertmanager ClusterIP 169.169.6.194 9001/TCP 38m kafka-operator-authproxy ClusterIP 169.169.215.71 8443/TCP 38m kafka-operator-operator ClusterIP 169.169.59.232 443/TCP 38m

[root@localhost ~]# kubectl get pods -n kafka

NAME READY STATUS RESTARTS AGE kafka-operator-operator-84c748cb5c-pk8mp 2/2 Running 0 26m prometheus-kafka-prometheus-0 1/2 ImagePullBackOff 0 15m

[root@localhost ~]# kubectl get pod

NAME READY STATUS RESTARTS AGE kafka-0-2z2tm 1/1 Running 0 21m kafka-1-8v6f2 1/1 Running 0 21m kafka-2-frfbg 1/1 Running 0 21m prometheus-operator-86b9f8646b-l2wwh 1/1 Running 0 5d17h zk-with-istio-0 1/1 Running 1 6d zk-with-istio-1 1/1 Running 48 15h zk-with-istio-2 1/1 Running 2 5d19h zookeeper-operator-8fd88c877-rrzhk 1/1 Running 1 6d1h

kubectl logs kafka-0-2z2tm

[2020-11-17 02:37:19,716] WARN [Producer clientId=CruiseControlMetricsReporter] Error while fetching metadata with correlation id 12399 : {__CruiseControlMetrics=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) [2020-11-17 02:37:19,817] WARN [Producer clientId=CruiseControlMetricsReporter] Error while fetching metadata with correlation id 12400 : {__CruiseControlMetrics=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) [root@localhost ~]# ^C

leader-us commented 3 years ago

I created missing webhook service

apiVersion: v1 kind: Service metadata: name: kafka-operator-webhook-service namespace: kafka spec: ports:

But cert error !!!

{"level":"info","ts":"2020-11-17T02:48:22.129Z","logger":"controllers.KafkaCluster","msg":"could not create cruise control topic: Internal error occurred: failed calling webhook \"kafkatopics.kafka.banzaicloud.io\": Post https://kafka-operator-webhook-service.kafka.svc:443/validate?timeout=30s: x509: certificate is valid for kafka-operator-operator.kafka.svc.cluster.local, kafka-operator-operator.kafka.svc, not kafka-operator-webhook-service.kafka.svc","Request.Namespace":"default/kafka","Request.Name":"kafka"}

leader-us commented 3 years ago

configurationState: ConfigInSync gracefulActionState: cruiseControlState: GracefulUpscaleSucceeded errorMessage: CruiseControl not yet ready rackAwarenessState: "" cruiseControlTopicStatus: CruiseControlTopicNotReady

baluchicken commented 2 years ago

Closing this since it is stale for a while, please reopen if it reoccurs.