Kong / gateway-operator

Kubernetes Operator for Kong Gateways
Apache License 2.0
51 stars 15 forks source link

`DataPlane`: add self healing for "live" resources when BG controller is enabled #160

Open pmalek opened 1 year ago

pmalek commented 1 year ago

Problem statement

Kong/gateway-operator-archive#91 introduced "self-healing" concept which made the operator replace the subresources (managed by the operator) to be not only updated in case a configuration drift happened but also recreated when they got deleted for some reason. So e.g. DataPlane Deployment would get recreated whenever it would get deleted.

After the introduction of BlueGreen DataPlane controller this stopped being the case when said controller is enabled because the execution of a reconciliation is only delegated to DataPlaneReconciler under concrete conditions: currently whenever a DataPlane doesn't have a BlueGreen rollout strategy defined and whenever it's "not ready": https://github.com/Kong/gateway-operator/blob/4ec986a1c32edbd0cab2a7817ab18d17d50b625d/controllers/dataplane_bluegreen_controller.go#L80-L89

What works now with regards to the above:

This issue tracks the effort of re-introducing the self-healing aspect to DataPlanes with BlueGreen rollout strategy for "live" resources.

Proposed solution(s)

Additional information

Blocked by https://github.com/Kong/gateway-operator/issues/1031.

Acceptance criteria

czeslavo commented 1 year ago

I wasn't able to find that stated directly in the api-server documentation, but it appears from the code that there's garbage collection which prunes the oldest resource versions. It runs every 5 minutes and is not configurable. That makes storing ResourceVersion and relying on it to get the proper "live" spec not robust enough as it would break if the persisted version was garbage collected.

For confirmation, there's also this reply under a k8s issue which confirms this behavior.

I think we have to make our own way to persist the "live" spec explicitly in DataPlane. I'll try to go with storing the whole spec as a JSON blob.

pmalek commented 1 year ago

I'm not saying we do this now but the CRD to be implemented in Kong/gateway-operator#159 has the potential for serving the purpose of holding said spec.

It would also be easier for the user to reason what spec was used in particular rollout.

czeslavo commented 1 year ago

Yeah, definitely that's a good idea to try to mix those two to not repeat the same job twice. 👍 I'll see if I can make it a minimally viable solution that would just carry the spec for now, making it ready for extension.

czeslavo commented 1 year ago

Together with @pmalek we came to conclusions:

czeslavo commented 1 year ago

As for now https://github.com/Kong/gateway-operator/issues/1048 will be a simpler solution to the problem of accidental removals of DataPlane-owned resources, I'm moving this one out of Cloud Gateways Phase 0 milestone @pmalek.