GoogleCloudPlatform / flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
Apache License 2.0
658 stars 266 forks source link

Support seamless job update, job history management and manual job rollback #183

Open elanv opened 4 years ago

elanv commented 4 years ago

Features

CR changes

Job update process summary

Job rollback policy

elanv commented 4 years ago

@functicons I have detailed seamless job update. May I ask for a review? If you agree, I would like to be assigned to this issue and contribute this feature.

If you think differently about this implementation details or have your own plan, please feel free to share it.

functicons commented 4 years ago

@elanv Thanks for the fantastic proposal! But as I understand, it might also increase the complexity of the operator. So, first I want to learn how do you compare a) adding this feature in the operator itself; b) writing a script to automate the job upgrade which first takes a savepoint, stop the existing job, then creates a new job with the savepoint? If the feature could be implemented on top of the operator as a script, we might consider adding such a script, it is easier to maintain and keeps the core operator simple. What do you think?

elanv commented 4 years ago

I think it's important to keep it simple. So I excluded the rollback handling when the update failed.

If you have a script-enabled environment, I think automation scripts will be enough. However, I am supporting Flink job control in java application. I think there may be a demand to deal with Flink job updates in applications in multiple languages as well as java. The burden of implementing the process is on its own, and if this feature is included, it will help a lot more people.

functicons commented 4 years ago

Sounds good, go head! The design looks good to me from a high level. I am just curious about why use annotation for flinkclusters.flinkoperator.k8s.io/revision instead of a field in job status?

elanv commented 4 years ago

Thanks for your review!

I missed the explanation. The reason for the revision as an annotation is that if you want to go back to a specific job version, you have to write that version in the FlinkCluster CR. Since status is observed information, it seems appropriate to use annotation to modify the revision. It could be added to CR spec also, but spec field is declarative and the rollback behavior is imperative, so I thought it would be better to use annotations.