GoogleCloudPlatform / flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
Apache License 2.0
658 stars 266 forks source link

How to recover the job manager from Checkpoints #361

Open shravangit20 opened 3 years ago

shravangit20 commented 3 years ago

Hi,

How do we recover the job manager from the checkpoints instead of savepoints? Any instructions steps to follow please share.

Thanks, Shravan

functicons commented 3 years ago

Recovering from checkpoints is transparent to the operator, it is handled by Flink itself, you don't need to worry about it.

shravangit20 commented 3 years ago

@functicons I am documenting the resiliency testing by disrupting taskmanagers/job managers and would like to understand how the recovery happens. Is there a way you can help my testing? Would it be possible to connect. with you offline? Also, I have setup a 3 node zookeeper along with the operator and Flink cluster but I am having issues setup the high availability configuration to perform the disruption testing. Just need some pointers on these 2 items.

benkusak commented 2 years ago

@shravangit20 I would be interested in how (if) you ultimately solutioned this. Presently I am locating the last available checkpoint and feeding it back manually in the fromSavepoint parameter manually when the job fails.