GoogleCloudPlatform / flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
Apache License 2.0
658 stars 266 forks source link

flink-operator-controller-manager pods in CrashLoopBackOff status #375

Open yanghui16355 opened 3 years ago

yanghui16355 commented 3 years ago

I found flink operator manager pods occasionally will change to CrashLoopBackOff status in couple situations:

  1. When I update a running flink pipeline, I got timeout error from operator and it shows operator pods in CrashLoopBackOff status, I have to manually delete the operator pods and then it will create new one in running status.
  2. Operator pods will crash and stuck every 6-12 hours and pipeline deployed by operator also been impact. Sometimes it will recover automatically but sometime I need to manually delete the operator pods to force recreating.
  3. I found flink pipeline deployed by operator will redeploy every few hours, and sometimes it will fail due to operator crash

The status for the operator is like following: ITUS000040-MAC:kubectl huiyang$ kubectl get pods,svc -n flink-operator-system NAME READY STATUS RESTARTS AGE pod/flink-operator-controller-manager-6886b99d68-2ktzd 1/2 CrashLoopBackOff 112 43h

As you can see it restarted many times after it deployed and one of them is crash. I checked the log of operator pods but not found anything specific error for it. I found couple times it is in OOM status before crash that maybe it has memory leak issue?

I enabled auto savepointing, not sure if it will impact

Can you provide suggest about how to debug this issue?

Thanks,

Hui

yanghui16355 commented 3 years ago

@functicons

yanghui16355 commented 3 years ago

I fixed it after I increase the resource for operator manager pod, the initial allocated memory is only 20Mi which looks too low for it to be stable. FYI @functicons

ckdarby commented 3 years ago

@yanghui16355 How did you increase as I don't see the operator manager pod itself expose resources in the CDR.

Only see the job & task allow for resources to be specified.

yanghui16355 commented 3 years ago

@ckdarby I changed the source code of operator and build the image by my own.

pashtet04 commented 2 years ago

Fixed it by increasing resources for memory requests and limits to 128Mi and 256Mi respectively https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/0310df76d6e2128cd5d2bc51fae4e842d370c463/helm-chart/flink-operator/templates/flink-operator.yaml#L345-L351

ckdarby commented 2 years ago

@pashtet04 I believe those values aren't exposed as a part of the chart but I'm using helm's --post-renderer to send into kustomize where we're able to change the manifest templates.