[Bug]: UserAssignedIdentities and FederatedIdentityCredentials are not able to sync since v1.0.0

gravufo commented 4 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Affected Resource(s)

managedidentity.azure.upbound.io/v1beta1 - UserAssignedIdentity
managedidentity.azure.upbound.io/v1beta1 - FederatedIdentityCredentials

Resource MRs required to reproduce the bug

apiVersion: managedidentity.azure.upbound.io/v1beta1
kind: UserAssignedIdentity
metadata:
  annotations:
    crossplane.io/external-name: /subscriptions/<redacted>/resourceGroups/rg-dev/providers/Microsoft.ManagedIdentity/userAssignedIdentities/msi-dev
  name: msi-dev
spec:
  forProvider:
    location: eastus
    name: msi-dev
    resourceGroupName: rg-dev
  managementPolicies:
  - Observe
  providerConfigRef:
    name: default

Steps to Reproduce

Apply >1000 UserAssignedIdentities in Observe mode and let them get synced and ready using version v0.42.0. Then, upgrade the provider to v1.0.0 (or later) and watch the objects start becoming unsynced.

What happened?

We are getting a lot of errors with context deadline exceeded such as this:

Also, we can see the Synced state of the objects dropping heavily and not being able to recover:

Note that the FederatedIdentityCredentials also seem to be affected. We did not see this behavior on a small scale (<10 objects) but consistently when the scale is in the thousands.

Relevant Error Output Snippet

No response

Crossplane Version

v1.15.2

Provider Version

v1.1.0

Kubernetes Version

v1.28.5

Kubernetes Distribution

AKS

Additional Info

I had created a thread in Slack here: https://crossplane.slack.com/archives/C019VE11LJJ/p1711905230102149 It may disappear if there is retention.

gravufo commented 1 month ago

More information:

We tried playing with --max-reconcile-rate, but ultimately we could not find a value that worked properly. Setting it too low just makes it inherently impossible to sync all objects (due to the sheer amount of resources it has to sync) and setting it too high just makes it fail all resources faster.
Digging further, we seem to have found that the context deadline exceeded error we would see in the logs is related to the reconcileTimeout and reconcileGracePeriod as can be seen here: https://github.com/crossplane/crossplane-runtime/blob/1e7193e9c065f7f5ceef465a824e111174464687/pkg/reconciler/managed/reconciler.go#L47C2-L47C40
- We think that when our issue happens, we are hitting rate limiting with the Azure API (hard to prove since we don't get the real error). The underlying azurerm terraform provider uses the official Azure SDK for Go which handles API rate limiting by respecting the 429's retry-after header and thus tries to do the call again after the specified time. This seems to make it so that the reconcile ends up busting the hardcoded limits set by reconcileTimeout and reconcileGracePeriod thus causing a context deadline exceeded error bubbling up and causing Synced state to turn to false.

On our side, we have created our own custom provider using the Azure SDK for Go directly and have implemented an optimisation (spec hash + save last external reconcile time) to reduce the quantity of external calls to a strict minimum in order to reduce the chance of hitting external rate limiting. We can see the results here:

MR states (left is our custom provider, right is the provider-upjet-azure) We can see that the new provider reconciles everything from scratch quite fast, whereas the upjet provider drops the Synced state quite fast on a pod restart and then struggles to recover.

The first screenshot below is the work queue depth of provider-upjet-azure and the second is the same thing for our custom provider. We can see that our custom provider gets through its queue quite fast while the upjet provider seems to constantly have a hard time getting through its queue, especially after the pod restart (around 12).

Overall, I think the main point here would be to figure out how external rate limiting is handled in this provider and/or upjet and seeing if there's a better way of handling it.

Hope this helps pinpoint the issue a little more.

jbw976 commented 1 month ago

Related crossplane-runtime issue: https://github.com/crossplane/crossplane-runtime/issues/696

Thanks again for all this data and insight @gravufo! 🙇‍♂️

crossplane-contrib / provider-upjet-azure