Open yanghui16355 opened 3 years ago
@functicons
I fixed it after I increase the resource for operator manager pod, the initial allocated memory is only 20Mi which looks too low for it to be stable. FYI @functicons
@yanghui16355 How did you increase as I don't see the operator manager pod itself expose resources
in the CDR.
Only see the job & task allow for resources
to be specified.
@ckdarby I changed the source code of operator and build the image by my own.
Fixed it by increasing resources for memory requests and limits to 128Mi
and 256Mi
respectively https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/0310df76d6e2128cd5d2bc51fae4e842d370c463/helm-chart/flink-operator/templates/flink-operator.yaml#L345-L351
@pashtet04 I believe those values aren't exposed as a part of the chart but I'm using helm's --post-renderer
to send into kustomize where we're able to change the manifest templates.
I found flink operator manager pods occasionally will change to CrashLoopBackOff status in couple situations:
The status for the operator is like following: ITUS000040-MAC:kubectl huiyang$ kubectl get pods,svc -n flink-operator-system NAME READY STATUS RESTARTS AGE pod/flink-operator-controller-manager-6886b99d68-2ktzd 1/2 CrashLoopBackOff 112 43h
As you can see it restarted many times after it deployed and one of them is crash. I checked the log of operator pods but not found anything specific error for it. I found couple times it is in OOM status before crash that maybe it has memory leak issue?
I enabled auto savepointing, not sure if it will impact
Can you provide suggest about how to debug this issue?
Thanks,
Hui