GoogleCloudPlatform / k8s-config-connector

GCP Config Connector, a Kubernetes add-on for managing GCP resources
https://cloud.google.com/config-connector/docs/overview
Apache License 2.0
901 stars 235 forks source link

Ready condition not stable even if desired and actual state are #387

Open yhrn opened 3 years ago

yhrn commented 3 years ago

Describe the bug KCC seems to set the Ready condition to false while reconciling a resource even if nothing has changed.

From a pure correctness perspective this seems wrong as the conditions should convey information about the resource and not about the controller of the resource, i.e. I as a user don't care about the fact that the reconcile loop is currently running.

From a more practical perspective this is very problematic for us when managing larger groups of resources together. We currently use something kpt live apply --reconcile-timeout=10m to apply ~1000 buckets/topics/subscriptions at a time. The second phase where kpt waits for reconciliation takes much longer than it should in the case where nothing has changed because it always finds resources that are not ready. We are also considering moving this logic cluster side using some grouping resource type that whose status would be the rolled up status of a number of grouped resources but we would face a similar problem for that - the rolled up status would basically never be ready.

ConfigConnector Version 1.36.0

To Reproduce Observe KCC resources over time without changing anything, the ready condition will flip to false and then quickly back to true. I have observed this for GCS buckets and P/S subscriptions but I suspect the behavior is the same for all KCC resource types.

xiaobaitusi commented 3 years ago

Hi @yhrn, thanks for reaching out.

Observe KCC resources over time without changing anything, the ready condition will flip to false and then quickly back to true.

If you grab the events in the time window, have you seen both Updating event and UpToDate event? It might be caused by ConfigConnector renew the label lease if you have cnrm.cloud.google.com/management-conflict-prevention-policy: "resource" on your resources.

Normally Config Connector will only set Ready condition to false if the underlying resource is under updating or it cannot be updated to the desired state.

yhrn commented 3 years ago

Hi @xiaobaitusi,

Well I can't see the internals of Config Connector but I can observe how the ready condition flips to false and then back to true. We have a namespace with ~2000 buckets and there we will almost always have some buckets that are not ready even when I'm 100% certain that no updates have been made to the k8s resources nor the underlying buckets.

milowe commented 3 years ago

@xiaobaitusi I can confirm that disabling the conflict management by setting cnrm.cloud.google.com/management-conflict-prevention-policy: "none" significantly reduces the time it takes to apply the manifests with kpt live apply . I assume this improvement is a result of removing the interference with the 20 minutes periodically update of all resources by Config Connector due to update of lease timestamp.

xiaobaitusi commented 3 years ago

Hi @milowe, thanks for confirming that cnrm.cloud.google.com/management-conflict-prevention-policy: "none" mitigates this issue. Though do you still see ready condition flipping between false and true afterward even after the conflict management is disabled? If so, please provide us concrete examples (for both StorageBucket and pubsub subscription) with your settings, we can investigate this further.

daniellervik commented 3 years ago

Hi all,

To add a few thoughts from config-connector end-user perspective. Being a developer it got me worried that our GCP resources was unstable or worst case re-created when they flipped between Updating and UpToDate every now and then. For some IAMPolicyMember assigned to our PubSubSubscription resources, they'd flip between DependencyNotFound and after some brief time go back to UpToDate (asking myself if the IAMPolicyMember is deactivated in-between or if it's there and working as usual all the time).

Took me some time to realize this is fine, and eventually the place I found out what was updated was in the Activity Details pane in Cloud Console. There I noticed the timing was pretty much exactly 20m in-between each update call, and that the update field mask was "labels" most likely confirming what you mention about the lease label being updated is the only change (not really affecting/updating/changing the resource from my perspective).

Either if this would not flip the state, or that otherwise there'd be a more clear reason/message for it changing to updating state it'd be easier to understand. Also in activity pane, I only see that "labels" are updated, not exactly which one.

On the point of setting conflict management to 'none', wouldn't it be "wrong" to disable a feature that should be a good thing for us to have just to remove the flickering in state. (Ofc there's a performance impact of having the conflict policy disabled as well, but to me that's a slightly different discussion.)

Example output:

Example output:

kubectl get pubsubsubscription -o=custom-columns='KIND:.kind,NAME:.metadata.name,READY:.status.conditions[*].status,REASON:.status.conditions[*].reason,MESSAGE:.status.conditions[*].message' -n <NAMESPACE> -w

KIND                 NAME            READY  REASON    MESSAGE
PubSubSubscription   subscription-1  True   UpToDate  The resource is up to date
PubSubSubscription   subscription-2  True   UpToDate  The resource is up to date
PubSubSubscription   subscription-1  False  Updating  Update in progress
PubSubSubscription   subscription-1  True   UpToDate  The resource is up to date
PubSubSubscription   subscription-2  False  Updating  Update in progress
PubSubSubscription   subscription-2  True   UpToDate  The resource is up to date

Getting -o json couldn't afaik reveal anything more about what was updated.

Hope this is helpful and gives some additional perspective to the discussion!

yhrn commented 3 years ago

Just to clarify, we see setting cnrm.cloud.google.com/management-conflict-prevention-policy: "none" strictly as a workaround, not an acceptable long term solution.

toumorokoshi commented 3 years ago

I'm thinking a more apt title for this issue would be: "Label lease updates should not set Ready to false". Thoughts on that?

The argument that Ready should reflect the state of the underlying resource, rather than the operations that have to do with the resource management is a good one.

I presume the value of the current behavior is that it is technically correct: the resource is being updated by Config Connector. Although I'm trying to think of the use cases where you would need that type of behavior, and I'm not sure I can think of a good one.

@xiaobaitusi thoughts? It may be difficult for us to split reconciles specifically for lease updates and those for spec changes, but let's agree on the philosophy first.

yhrn commented 3 years ago

@toumorokoshi Since this is a public issue tracker but KCC is closed source and pretty much a black box for us using it I prefer the current title which describes the observed, problematic behavior rather than the root cause/one of the root causes.

I really think that it is important that KCC only changes the Ready condition to false when it has detected that there is a difference between the desired and actual state (regardless of why, i.e. change to the spec or underlying resource don't matter) and never because of some internal workings of the controller. In my opinion anything else is breaking a well established behavioral contract for k8s resources. Tools like Kpt (also built by Google) rely on this behavior to create aggregate statuses for groups of resources, etc. See cli-utils for more examples of how KCC resources should ideally behave (Reconciling and Stalled conditions, metadata.generation/status.observedGeneration, etc.)

toumorokoshi commented 3 years ago

Sorry, to clarify my stance, I agree with what you're saying. Just want to make sure that we don't impact some other workflow that needs to be notified when KCC is doing any state change on the resource, internal or otherwise.

Practically speaking, would it be sufficient to only modify the behavior of the ready field? It seems like Config Connector's status behavior needs to completely adhere to kstatus / kpt conventions to ensure that integration with kpt works as expected.

Tools like Kpt (also built by Google) rely on this behavior to create aggregate statuses for groups of resources, etc. See cli-utils for more examples of how KCC resources should ideally behave (Reconciling and Stalled conditions, metadata.generation/status.observedGeneration, etc.)

This was reported as an internal issue, and we are planning on tackling that around the Q1/Q2 boundary. I've filed #410 as a public tracking ticket.

yhrn commented 3 years ago

@toumorokoshi Yes, as a first step, and to solve our really urgent problems, it would be sufficient with fixing the behavior of the Ready condition, i.e. making it stay "True" until it has been determined for sure that the actual and desired states differ.

yhrn commented 3 years ago

Are there any updates on this issue?

xiaobaitusi commented 3 years ago

Yes, as a first step, and to solve our really urgent problems, it would be sufficient with fixing the behavior of the Ready condition, i.e. making it stay "True" until it has been determined for sure that the actual and desired states differ.

With v1.43.0 release, label leaser is turned off by default, theoretically the ready condition will remain ready unless the user desired state is detected different with the underlying resource.

See cli-utils for more examples of how KCC resources should ideally behave (Reconciling and Stalled conditions, metadata.generation/status.observedGeneration, etc.)

With v1.46.0 release, KCC added observedGeneration field to status for resources, enabling compatibility with kstatus

yhrn commented 3 years ago

@xiaobaitusi We are on 1.49.1 and unless we have cnrm.cloud.google.com/management-conflict-prevention-policy: "none" we still see the ready condition flapping, at least for buckets.

toumorokoshi commented 3 years ago

@xiaobaitusi We are on 1.49.1 and unless we have cnrm.cloud.google.com/management-conflict-prevention-policy: "none" we still see the ready condition flapping, at least for buckets.

@yhrn just to clarify: is this happening with brand new resources, or with old ones?

the code adds a default annotation on create of an object. I verified that new resources have the annotation set none by default, but existing objects will use the previously applied annotation (presumably "resource") if created with KCC below 1.43.

yhrn commented 3 years ago

@toumorokoshi Ok, I interpreted "label leaser is turned off by default" as it would be turned off in the absence of the annotation. I can confirm that the annotation is added to new resources. My inclination is to keep this issue open though since this is still a problem if one wants to use conflict prevention. We currently don't so from our perspective it has lower priority.

toumorokoshi commented 3 years ago

@toumorokoshi Ok, I interpreted "label leaser is turned off by default" as it would be turned off in the absence of the annotation. I can confirm that the annotation is added to new resources. My inclination is to keep this issue open though since this is still a problem if one wants to use conflict prevention. We currently don't so from our perspective it has lower priority.

Yes, KRM isn't super flexible in this regard as once an annotation is added, it's hard to tell where it came from, so we can't do something like "user didn't set this value so use the new default". I'd have to do dig but managedFields might be usable for this purpose.

+1 to leave this open until we get this working with the label leaser as well. But good to hear confirmation that at least for new resources things are working as currently intended.