Open gravufo opened 4 months ago
More information:
--max-reconcile-rate
, but ultimately we could not find a value that worked properly. Setting it too low just makes it inherently impossible to sync all objects (due to the sheer amount of resources it has to sync) and setting it too high just makes it fail all resources faster.context deadline exceeded
error we would see in the logs is related to the reconcileTimeout
and reconcileGracePeriod
as can be seen here: https://github.com/crossplane/crossplane-runtime/blob/1e7193e9c065f7f5ceef465a824e111174464687/pkg/reconciler/managed/reconciler.go#L47C2-L47C40
azurerm
terraform provider uses the official Azure SDK for Go which handles API rate limiting by respecting the 429's retry-after
header and thus tries to do the call again after the specified time. This seems to make it so that the reconcile ends up busting the hardcoded limits set by reconcileTimeout
and reconcileGracePeriod
thus causing a context deadline exceeded
error bubbling up and causing Synced
state to turn to false.On our side, we have created our own custom provider using the Azure SDK for Go directly and have implemented an optimisation (spec hash + save last external reconcile time) to reduce the quantity of external calls to a strict minimum in order to reduce the chance of hitting external rate limiting. We can see the results here:
MR states (left is our custom provider, right is the provider-upjet-azure
)
We can see that the new provider reconciles everything from scratch quite fast, whereas the upjet provider drops the Synced
state quite fast on a pod restart and then struggles to recover.
The first screenshot below is the work queue depth of provider-upjet-azure
and the second is the same thing for our custom provider. We can see that our custom provider gets through its queue quite fast while the upjet provider seems to constantly have a hard time getting through its queue, especially after the pod restart (around 12).
Overall, I think the main point here would be to figure out how external rate limiting is handled in this provider and/or upjet and seeing if there's a better way of handling it.
Hope this helps pinpoint the issue a little more.
Related crossplane-runtime issue: https://github.com/crossplane/crossplane-runtime/issues/696
Thanks again for all this data and insight @gravufo! 🙇♂️
Is there an existing issue for this?
Affected Resource(s)
Resource MRs required to reproduce the bug
Steps to Reproduce
Apply >1000
UserAssignedIdentities
in Observe mode and let them get synced and ready using version v0.42.0. Then, upgrade the provider to v1.0.0 (or later) and watch the objects start becoming unsynced.What happened?
We are getting a lot of errors with
context deadline exceeded
such as this:Also, we can see the
Synced
state of the objects dropping heavily and not being able to recover:Note that the FederatedIdentityCredentials also seem to be affected. We did not see this behavior on a small scale (<10 objects) but consistently when the scale is in the thousands.
Relevant Error Output Snippet
No response
Crossplane Version
v1.15.2
Provider Version
v1.1.0
Kubernetes Version
v1.28.5
Kubernetes Distribution
AKS
Additional Info
I had created a thread in Slack here: https://crossplane.slack.com/archives/C019VE11LJJ/p1711905230102149 It may disappear if there is retention.