Closed surajpuvvada closed 4 years ago
Recovery from checkpoints happens at a lower level which is transparent to the operator. The savepoint and recovery mechanism provided by the operator works at a higher level which primarily deals with failures which checkpoints cannot handle and caused the whole job to fail, e.g., transient errors caused by external dependencies.
Hello
I have been reading online that Flink supports task/job failure recovery from checkpoints. Just curious to know why that option doesn't exist in the operator ? Currently it cancels the job and recreates the cluster and resumes from a savepoint. It has been mentioned that savepoints are more expensive than checkpoints and having the entire job resume from a savepoint could potentially be expensive ?
Thanks