Read service events during deletion

timuthy commented 3 years ago

How to categorize this PR?

/area ops-productivity /kind enhancement /priority normal

What this PR does / why we need it: This PR lets the GRM read Service events in case it is stuck in deletion. The output of the two last warnings is added to the ManagedResource condition:

- lastTransitionTime: "2021-01-29T15:41:41Z"
      lastUpdateTime: "2021-01-29T16:01:45Z"
      message: |-
        Could not clean all old resources: 1 error occurred: deletion of old resource "v1/Service/default/test-svc-123" is still pending
        -> Events:
        * ipallocator-repair-controller reported 11s ago (x7 over 18m): Cluster IP 10.251.238.105 is not allocated; repairing
        * portallocator-repair-controller reported 11s ago (x7 over 18m): Port 30915 is not allocated; repairing
      reason: DeletionPending
      status: Progressing
      type: ResourcesApplied

The motivation for this change is to reveal more information about the root cause of the problem.

Special notes for your reviewer: I was thinking about reading warnings in general and not only for Services but OTHO this would mean more read activities and the relevant use case we know of today is about Services. Please let me know if we should change it or extend the list by further kinds.

PR will remain in draft state as long as https://github.com/gardener/gardener-resource-manager/pull/105 is not merged.

Release note:

Gardener-Resource-Manager now adds latest warning events to a ManagedResource's `.status.conditions` in case a Kubernetes `Service` cannot be deleted. This allows to get more context about the underlying problem e.g., when Cloud-Controller-Manager cannot delete the backing load balancer.

If a ManagedResource refers to a `Service` object of type `LoadBalancer`, the Gardener Resource Manager now regularly checks if the `Service` has an `Ingress` status and contributes the result of this check to the `ResourcesHealthy` condition.

rfranzke commented 3 years ago

/rebase

rfranzke commented 3 years ago

/assign

timebertt commented 3 years ago

What do you guys think about enhancing the health check controller as follows:

Check if Services of type LoadBalancer are healthy, i.e., whether their .status contains an IP or hostname.

If not, read the Events and propagate them?

Sounds like a good idea to me.

timuthy commented 3 years ago

Looks good in general. /invite @timebertt for /second-opinion

What do you guys think about enhancing the health check controller as follows:

Check if Services of type LoadBalancer are healthy, i.e., whether their .status contains an IP or hostname.

If not, read the Events and propagate them?

Yes, sounds reasonable. Should we then also only read events during deletion when the service is of type LB? It probably makes sense to handle these cases the same way, doesn't it?

rfranzke commented 3 years ago

Hm, yeah, maybe it's fair to start with this approach @timuthy. In the future, we might want switching to simply reading events always if a certain resource "is stuck in deletion" for "too long". However, let's see how it goes and get more experience if this would even help us.

timuthy commented 3 years ago

Hm, yeah, maybe it's fair to start with this approach @timuthy. In the future, we might want switching to simply reading events always if a certain resource "is stuck in deletion" for "too long". However, let's see how it goes and get more experience if this would even help us.

When we only wanted to read events of type LB during deletion, we'd need to read the whole object again because we don't have the service spec at hand. I'm not sure if it's worth it or if we by default try to read events for all sorts of services, i.e. it doesn't hurt, imho. Please let me know if that is critical from your point of view. Otherwise, the health checks have been enhanced by the requested feature.

timebertt commented 3 years ago

@timuthy do you want to keep the commits or squash? Feel free to merge, however you like.

gardener-attic / gardener-resource-manager

Read service events during deletion #106