Open pmalek opened 1 year ago
I wasn't able to find that stated directly in the api-server documentation, but it appears from the code that there's garbage collection which prunes the oldest resource versions. It runs every 5 minutes and is not configurable. That makes storing ResourceVersion
and relying on it to get the proper "live" spec not robust enough as it would break if the persisted version was garbage collected.
For confirmation, there's also this reply under a k8s issue which confirms this behavior.
I think we have to make our own way to persist the "live" spec explicitly in DataPlane. I'll try to go with storing the whole spec as a JSON blob.
I'm not saying we do this now but the CRD to be implemented in Kong/gateway-operator#159 has the potential for serving the purpose of holding said spec.
It would also be easier for the user to reason what spec was used in particular rollout.
Yeah, definitely that's a good idea to try to mix those two to not repeat the same job twice. 👍 I'll see if I can make it a minimally viable solution that would just carry the spec for now, making it ready for extension.
Together with @pmalek we came to conclusions:
As for now https://github.com/Kong/gateway-operator/issues/1048 will be a simpler solution to the problem of accidental removals of DataPlane-owned resources, I'm moving this one out of Cloud Gateways Phase 0
milestone @pmalek.
Problem statement
Kong/gateway-operator-archive#91 introduced "self-healing" concept which made the operator replace the subresources (managed by the operator) to be not only updated in case a configuration drift happened but also recreated when they got deleted for some reason. So e.g.
DataPlane
Deployment
would get recreated whenever it would get deleted.After the introduction of
BlueGreen
DataPlane
controller this stopped being the case when said controller is enabled because the execution of a reconciliation is only delegated toDataPlaneReconciler
under concrete conditions: currently whenever aDataPlane
doesn't have a BlueGreen rollout strategy defined and whenever it's "not ready": https://github.com/Kong/gateway-operator/blob/4ec986a1c32edbd0cab2a7817ab18d17d50b625d/controllers/dataplane_bluegreen_controller.go#L80-L89What works now with regards to the above:
This issue tracks the effort of re-introducing the self-healing aspect to
DataPlane
s with BlueGreen rollout strategy for "live" resources.Proposed solution(s)
Additional information
Blocked by https://github.com/Kong/gateway-operator/issues/1031.
Acceptance criteria
DataPlane
subresources whenever they are deletedDataPlane
subresources whenever they are changed without promotion