GoogleCloudPlatform / flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
Apache License 2.0
658 stars 266 forks source link

Fix savepoint problems #392

Closed shashken closed 3 years ago

shashken commented 3 years ago

I found 2 problems related to savepoints:

  1. When upgrading a job, there was no option to take a savepoint before upgrading (and using it to restore) added a flag to fix this case

  2. When a cluster starts it tries to take a savepoint, the savepoint status only updates once it completes, this creates a situation where a new savepoint gets triggered while the previous one is still running, and it keeps happening if your savepoints won't finish quickly (forever) I solved this with another value that holds the savepoint trigger time, and an increased savepoint timeout, so while a savepoint is still running a new one will not get triggered.

@functicons I'd love to get your feedback on this, we might want to create a stronger solution later on but for now savepoints are impossible to use with this operator if they take some time.

functicons commented 3 years ago

/gcbrun

functicons commented 3 years ago

Thanks for the PR, will review as soon as I get a chance.

functicons commented 3 years ago

/gcbrun