GoogleCloudPlatform / flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
Apache License 2.0
657 stars 265 forks source link

Getting context deadline exceeded error on eks cluster #399

Open vinaykw opened 3 years ago

vinaykw commented 3 years ago

I am trying to deploy flink on aws eks cluster. The cluster does not have any specific master node as it is amazon managed cluster. I have successfully deployed the flink operator chart using helm. Next I tried deploying flink session cluster chart. For this I am getting below error:

helm install flink-session flink/flink-session-cluster/ -f version.yaml --debug
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /home/centos/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/centos/.kube/config
install.go:172: [debug] Original chart version: ""
install.go:189: [debug] CHART PATH: /opt/eva/helm/charts/flink/flink-session-cluster

client.go:122: [debug] creating 5 resource(s)
Error: Internal error occurred: failed calling webhook "mflinkcluster.flinkoperator.k8s.io": Post https://flink-operator-webhook-service.default.svc:443/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster?timeout=30s: context deadline exceeded
helm.go:81: [debug] Internal error occurred: failed calling webhook "mflinkcluster.flinkoperator.k8s.io": Post https://flink-operator-webhook-service.default.svc:443/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster?timeout=30s: context deadline exceeded

One observation I had is that there is no communication (ping not working) between the node (aws launch pad machine) that I am using to deploy session cluster chart and the webhook service "flink-operator-webhook-service.default.svc" .

I have successfully deployed flink operator and the flink cluster on the non aws k8s cluster. There I can see that such communication is present.

Can Someone help me in knowing what is the issue?

lliknart commented 1 year ago

Hello @vinaykw did you resolve this issue ? (more than one year later ^^)

emmanuelCarre commented 1 year ago

Hello,

I got similar issue with an other operator... May be it could help:

Take a look to security groups (with terraform-aws-eks, they will be named xxxx-cluster and xxxx-node). By default ports 443 and 10250 are allow; if your pod doesn't listen on one allowed port, it will be blocked (even if k8s api server call service url with port 443). To check used port, use command kubectl get endpoints -n <operator namespace>. You can read more about endpoint here.

Some operator like prometheus or cert-manager don't have this issue because their validation webhook is listening on port 10250.