Failure recovery using checkpoints

GoogleCloudPlatform / flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.

Apache License 2.0

658 stars 265 forks source link

Failure recovery using checkpoints #297

Closed surajpuvvada closed 4 years ago

surajpuvvada commented 4 years ago

Hello

I have been reading online that Flink supports task/job failure recovery from checkpoints. Just curious to know why that option doesn't exist in the operator ? Currently it cancels the job and recreates the cluster and resumes from a savepoint. It has been mentioned that savepoints are more expensive than checkpoints and having the entire job resume from a savepoint could potentially be expensive ?

Thanks

functicons commented 4 years ago

Recovery from checkpoints happens at a lower level which is transparent to the operator. The savepoint and recovery mechanism provided by the operator works at a higher level which primarily deals with failures which checkpoints cannot handle and caused the whole job to fail, e.g., transient errors caused by external dependencies.