crossplane-contrib / provider-upjet-azure

Official Azure Provider for Crossplane by Upbound.
Apache License 2.0
57 stars 74 forks source link

[Bug]: UserAssignedIdentities and FederatedIdentityCredentials are not able to sync since v1.0.0 #740

Open gravufo opened 4 months ago

gravufo commented 4 months ago

Is there an existing issue for this?

Affected Resource(s)

Resource MRs required to reproduce the bug

apiVersion: managedidentity.azure.upbound.io/v1beta1
kind: UserAssignedIdentity
metadata:
  annotations:
    crossplane.io/external-name: /subscriptions/<redacted>/resourceGroups/rg-dev/providers/Microsoft.ManagedIdentity/userAssignedIdentities/msi-dev
  name: msi-dev
spec:
  forProvider:
    location: eastus
    name: msi-dev
    resourceGroupName: rg-dev
  managementPolicies:
  - Observe
  providerConfigRef:
    name: default

Steps to Reproduce

Apply >1000 UserAssignedIdentities in Observe mode and let them get synced and ready using version v0.42.0. Then, upgrade the provider to v1.0.0 (or later) and watch the objects start becoming unsynced.

What happened?

We are getting a lot of errors with context deadline exceeded such as this: image

Also, we can see the Synced state of the objects dropping heavily and not being able to recover: image

Note that the FederatedIdentityCredentials also seem to be affected. We did not see this behavior on a small scale (<10 objects) but consistently when the scale is in the thousands.

Relevant Error Output Snippet

No response

Crossplane Version

v1.15.2

Provider Version

v1.1.0

Kubernetes Version

v1.28.5

Kubernetes Distribution

AKS

Additional Info

I had created a thread in Slack here: https://crossplane.slack.com/archives/C019VE11LJJ/p1711905230102149 It may disappear if there is retention.

gravufo commented 1 month ago

More information:

On our side, we have created our own custom provider using the Azure SDK for Go directly and have implemented an optimisation (spec hash + save last external reconcile time) to reduce the quantity of external calls to a strict minimum in order to reduce the chance of hitting external rate limiting. We can see the results here:

MR states (left is our custom provider, right is the provider-upjet-azure) We can see that the new provider reconciles everything from scratch quite fast, whereas the upjet provider drops the Synced state quite fast on a pod restart and then struggles to recover. image

The first screenshot below is the work queue depth of provider-upjet-azure and the second is the same thing for our custom provider. We can see that our custom provider gets through its queue quite fast while the upjet provider seems to constantly have a hard time getting through its queue, especially after the pod restart (around 12). image image

Overall, I think the main point here would be to figure out how external rate limiting is handled in this provider and/or upjet and seeing if there's a better way of handling it.

Hope this helps pinpoint the issue a little more.

jbw976 commented 1 month ago

Related crossplane-runtime issue: https://github.com/crossplane/crossplane-runtime/issues/696

Thanks again for all this data and insight @gravufo! 🙇‍♂️