GoogleCloudPlatform / flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
Apache License 2.0
658 stars 266 forks source link

Job is lost after JobManager restart #398

Closed morelina closed 3 years ago

morelina commented 3 years ago

Hi,

I am running the latest version of flink-operator. I have a JobCluster, but now, if the JobManager is restarted for any reason, the FlinkCluster cannot recover. Operator shows this in the logs:

INFO controllers.FlinkCluster Desired state {"cluster": "namespace-a/job-a", "Job": "nil"}

Why is the Desired State of Job null? Also, the FlinkCluster object shows Job Status "Lost".

In previous versions (when the submitter job never completed) the operator had no problem and would create a new Job if something went wrong.

Thanks!

elanv commented 3 years ago

The operator recovers the job under the following conditions. Could you check if it meets the conditions below?

The behavior of job submitter has been changed due to the issues described in the PR below. https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/pull/379

morelina commented 3 years ago

savepointsDir and restartPolicy are properly set. I thought the problem then was that the Cluster was newly deployed, and there was still no successful savepoint when the JobManager restarted. So, what I have done:

Now I have another issue. It seems that after taking the savepoint, the JobManager indeed recovers the Job. However, the operator does not recognize the JobID and cancels the job, and subsequently, deletes the cluster.

INFO controllers.FlinkCluster Cancelling unexpected running job(s) {"cluster": "namespace-a/flinkcluster-a"} INFO controllers.FlinkCluster Cancel running job {"cluster": "namespace-a/flinkcluster-a", "jobID": "230523442346adce8e31243e2ad8534"}

elanv commented 3 years ago

Thank you for checking. I figured out the problem and created a PR. I have built operator image metatronapp/flink-operator:v1beta1-20220202 from the PR branch and tested with it.

morelina commented 3 years ago

I just checked, and now it properly resubmits the job:

Normal ControlRequested 2m44s (x2 over 2m55s) FlinkOperator Requested new user control savepoint Normal SavepointCreated 2m44s FlinkOperator Successfully savepoint created Normal ControlSucceeded 2m44s FlinkOperator Succesfully completed user control savepoint Normal StatusUpdate 77s (x3 over 5m35s) FlinkOperator JobManager StatefulSet status changed: NotReady -> Ready Normal StatusUpdate 62s (x4 over 119s) FlinkOperator (combined from similar events): Job status changed: Running -> Lost Normal StatusUpdate 8s (x2 over 4m29s) FlinkOperator Job status changed: Pending -> Running

Thank you!