Closed morelina closed 3 years ago
The operator recovers the job under the following conditions. Could you check if it meets the conditions below?
spec.job.savepointsDir
is setspec.job.restartPolicy
is set to FromSavepointOnFailure
. The default is Never
, in which case the job is not recovered and remains in the Lost
state in the event of a job manager failure.status.job.savepointLocation
is recorded. (It is required for job recovery)The behavior of job submitter has been changed due to the issues described in the PR below. https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/pull/379
savepointsDir and restartPolicy are properly set. I thought the problem then was that the Cluster was newly deployed, and there was still no successful savepoint when the JobManager restarted. So, what I have done:
Now I have another issue. It seems that after taking the savepoint, the JobManager indeed recovers the Job. However, the operator does not recognize the JobID and cancels the job, and subsequently, deletes the cluster.
INFO controllers.FlinkCluster Cancelling unexpected running job(s) {"cluster": "namespace-a/flinkcluster-a"} INFO controllers.FlinkCluster Cancel running job {"cluster": "namespace-a/flinkcluster-a", "jobID": "230523442346adce8e31243e2ad8534"}
Thank you for checking. I figured out the problem and created a PR. I have built operator image metatronapp/flink-operator:v1beta1-20220202
from the PR branch and tested with it.
I just checked, and now it properly resubmits the job:
Normal ControlRequested 2m44s (x2 over 2m55s) FlinkOperator Requested new user control savepoint Normal SavepointCreated 2m44s FlinkOperator Successfully savepoint created Normal ControlSucceeded 2m44s FlinkOperator Succesfully completed user control savepoint Normal StatusUpdate 77s (x3 over 5m35s) FlinkOperator JobManager StatefulSet status changed: NotReady -> Ready Normal StatusUpdate 62s (x4 over 119s) FlinkOperator (combined from similar events): Job status changed: Running -> Lost Normal StatusUpdate 8s (x2 over 4m29s) FlinkOperator Job status changed: Pending -> Running
Thank you!
Hi,
I am running the latest version of flink-operator. I have a JobCluster, but now, if the JobManager is restarted for any reason, the FlinkCluster cannot recover. Operator shows this in the logs:
INFO controllers.FlinkCluster Desired state {"cluster": "namespace-a/job-a", "Job": "nil"}
Why is the Desired State of Job null? Also, the FlinkCluster object shows Job Status "Lost".
In previous versions (when the submitter job never completed) the operator had no problem and would create a new Job if something went wrong.
Thanks!