crossplane / crossplane-runtime

A set of libraries for writing Crossplane controllers.
https://crossplane.io
Apache License 2.0
158 stars 101 forks source link

Frequent reconciliations while waiting for external resource deletion #305

Open sergenyalcin opened 2 years ago

sergenyalcin commented 2 years ago

What happened?

While trying to delete a Managed Resource that I created on provider-jet-azure, I observed that the resource's controller reconciles very often. To see if this behavior is normal, I debug provider-jet-azure and observed that in the Reconcile function of crossplane-runtime, we set the Requeue value of the Result object to true after the deletion requested.

https://github.com/crossplane/crossplane-runtime/blob/21928d24c8b292a283794b1b0be1352e8b193b98/pkg/reconciler/managed/reconciler.go#L735

When I realized this situation, I expected to observe the same behavior in native providers and I did a debugging in provider-azure. I observed the same situation for native providers (frequent reconcile). However, after the second or third reconcile function, the sdk's get call no longer finds the resource in the cloud, so reconciling is not repeated too many times. Therefore, this is not a problem for native providers, as it does not reconcile too many times despite frequent reconciliations. However, since the deletion process in jet-providers takes a long time due to the terraform binary, we observe reconcile operation 5-10 times in a second for a few minutes.

2021-11-25T20:58:27.733+0300    DEBUG   provider-jet-azure      Reconciling     {"controller": "managed/postgresql.azure.jet.crossplane.io/v1alpha1, kind=configuration", "request": "/example-psql-configuration"}
2021-11-25T20:58:27.734+0300    DEBUG   provider-jet-azure      Successfully requested deletion of external resource    {"controller": "managed/postgresql.azure.jet.crossplane.io/v1alpha1, kind=configuration", "request": "/example-psql-configuration", "uid": "6854b1e0-cc0c-4020-b2a7-a80be394acf0", "version": "1295367", "external-name": "max_wal_senders", "deletion-timestamp": "2021-11-25 20:58:08 +0300 +03"}
2021-11-25T20:58:27.734+0300    DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"Configuration","name":"example-psql-configuration","uid":"6854b1e0-cc0c-4020-b2a7-a80be394acf0","apiVersion":"postgresql.azure.jet.crossplane.io/v1alpha1","resourceVersion":"1295367"}, "reason": "DeletedExternalResource", "message": "Successfully requested deletion of external resource"}
2021-11-25T20:58:27.741+0300    DEBUG   provider-jet-azure      Reconciling     {"controller": "managed/postgresql.azure.jet.crossplane.io/v1alpha1, kind=configuration", "request": "/example-psql-configuration"}
2021-11-25T20:58:27.741+0300    DEBUG   provider-jet-azure      Successfully requested deletion of external resource    {"controller": "managed/postgresql.azure.jet.crossplane.io/v1alpha1, kind=configuration", "request": "/example-psql-configuration", "uid": "6854b1e0-cc0c-4020-b2a7-a80be394acf0", "version": "1295367", "external-name": "max_wal_senders", "deletion-timestamp": "2021-11-25 20:58:08 +0300 +03"}
2021-11-25T20:58:27.741+0300    DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"Configuration","name":"example-psql-configuration","uid":"6854b1e0-cc0c-4020-b2a7-a80be394acf0","apiVersion":"postgresql.azure.jet.crossplane.io/v1alpha1","resourceVersion":"1295367"}, "reason": "DeletedExternalResource", "message": "Successfully requested deletion of external resource"}
2021-11-25T20:58:27.751+0300    DEBUG   provider-jet-azure      Reconciling     {"controller": "managed/postgresql.azure.jet.crossplane.io/v1alpha1, kind=configuration", "request": "/example-psql-configuration"}
2021-11-25T20:58:27.751+0300    DEBUG   provider-jet-azure      Successfully requested deletion of external resource    {"controller": "managed/postgresql.azure.jet.crossplane.io/v1alpha1, kind=configuration", "request": "/example-psql-configuration", "uid": "6854b1e0-cc0c-4020-b2a7-a80be394acf0", "version": "1295367", "external-name": "max_wal_senders", "deletion-timestamp": "2021-11-25 20:58:08 +0300 +03"}
2021-11-25T20:58:27.751+0300    DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"Configuration","name":"example-psql-configuration","uid":"6854b1e0-cc0c-4020-b2a7-a80be394acf0","apiVersion":"postgresql.azure.jet.crossplane.io/v1alpha1","resourceVersion":"1295367"}, "reason": "DeletedExternalResource", "message": "Successfully requested deletion of external resource"}
2021-11-25T20:58:27.759+0300    DEBUG   provider-jet-azure      Reconciling     {"controller": "managed/postgresql.azure.jet.crossplane.io/v1alpha1, kind=configuration", "request": "/example-psql-configuration"}
2021-11-25T20:58:27.760+0300    DEBUG   provider-jet-azure      Successfully requested deletion of external resource    {"controller": "managed/postgresql.azure.jet.crossplane.io/v1alpha1, kind=configuration", "request": "/example-psql-configuration", "uid": "6854b1e0-cc0c-4020-b2a7-a80be394acf0", "version": "1295367", "external-name": "max_wal_senders", "deletion-timestamp": "2021-11-25 20:58:08 +0300 +03"}
2021-11-25T20:58:27.760+0300    DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"Configuration","name":"example-psql-configuration","uid":"6854b1e0-cc0c-4020-b2a7-a80be394acf0","apiVersion":"postgresql.azure.jet.crossplane.io/v1alpha1","resourceVersion":"1295367"}, "reason": "DeletedExternalResource", "message": "Successfully requested deletion of external resource"}
2021-11-25T20:58:27.767+0300    DEBUG   provider-jet-azure      Reconciling     {"controller": "managed/postgresql.azure.jet.crossplane.io/v1alpha1, kind=configuration", "request": "/example-psql-configuration"}
2021-11-25T20:58:27.767+0300    DEBUG   provider-jet-azure      Successfully requested deletion of external resource    {"controller": "managed/postgresql.azure.jet.crossplane.io/v1alpha1, kind=configuration", "request": "/example-psql-configuration", "uid": "6854b1e0-cc0c-4020-b2a7-a80be394acf0", "version": "1295367", "external-name": "max_wal_senders", "deletion-timestamp": "2021-11-25 20:58:08 +0300 +03"}

How can we reproduce it?

Firstly, you should create a MR in provider-jet-azure. After the external resource provisioned and the state of MR is ready, you should delete the MR. Then you can check the provicer-jet-azure logs and see the frequent reconciliations.

What environment did it happen in?

Crossplane version: crossplane-runtime v0.15.1-0.20211004150827-579c1833b513

muvaf commented 2 years ago

We're seeing a similar behavior in AWS RDS Instance or EKS clusters as well where deletion takes ~3-10 minutes. In Jet providers, we have callback mechanisms for async calls, so even if we don't requeue after deletion call, we will get an event once it's finished but for native ones, there is no way to know if the deletion is completed unless you run Observe.

I think using RequeueAfter: time.Second * 1 could be a long way, at least saving us from ms-level intervals.

negz commented 2 years ago

I think in the jet providers this could be in part because we're missing a layer of rate limiting. Most Crossplane providers use a rate limiter that applies exponential backoff from 1 to 60 seconds per individual object, and is further constrained by a global token bucket rate limiter that tries to keep overall requeues at 1 per second (allowing bursts of up to 10).

You can see that at https://github.com/crossplane/provider-gcp/blob/cf9303bdf/pkg/controller/database/cloudsql.go#L77

It seems like the jet providers are only using the global rate limiter - not the managed resource scoped one, per:

Note that the global rate limiter is never wrapped with a NewDefaultManagedRateLimiter.

That said, I would expect the global rate limiter alone to level out to reconciling once per second (across all managed resources) with a maximum burst of 10 reconciles per second. Or put otherwise, seeing ~10 reconciles for roughly the first second makes sense to me but after that I'd expect it to happen less frequently.

I would expect that native providers - and jet providers if they started using the DefaultManagedRateLimiter - should be limited to one requeue per second per kind of managed resource.

Note though that I'm in the (slow) process of changing this all per https://github.com/crossplane/crossplane/issues/2595.

ulucinar commented 2 years ago

Then what about consuming crossplane-runtime v1.6 in terrajet-based providers and adding the --max-reconcile-rate command-line option to these providers? We should also consider the same for the https://github.com/crossplane-contrib/provider-jet-template repo.

negz commented 2 years ago

@ulucinar I agree we should update both. https://github.com/crossplane/crossplane-runtime/issues/285 might block that though - I'm currently looking into it.

muvaf commented 2 years ago

@negz How would https://github.com/crossplane/crossplane-runtime/issues/285 block? I don't think Terrajet is affected by it anymore.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 months ago

Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as stale because it has had no activity in the last 90 days. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.