crossplane / crossplane

The Cloud Native Control Plane
https://crossplane.io
Apache License 2.0
9.28k stars 929 forks source link

Installing packages with many CRDs causes reconciler to exceed context deadline #2564

Closed turkenh closed 2 years ago

turkenh commented 2 years ago

What happened?

Cannot install a provider package containing 750+ CRDs.

Trying to install provider-tf-aws built from this commit, but package is never installed.

I am observing context deadline errors in providerrevisions.pkg.crossplane.io events:

Spec:
  Desired State:                  Active
  Ignore Crossplane Constraints:  false
  Image:                          turkenh/provider-tf-aws:daf1e9f7-2
  Package Pull Policy:            IfNotPresent
  Revision:                       1
  Skip Dependency Resolution:     false
Status:
  Conditions:
    Last Transition Time:  2021-09-10T09:14:52Z
    Reason:                UnhealthyPackageRevision
    Status:                False
    Type:                  Healthy
  Controller Ref:
    Name:
Events:
  Type     Reason             Age                  From                                         Message
  ----     ------             ----                 ----                                         -------
  Warning  SyncPackage        76s                  packages/providerrevision.pkg.crossplane.io  cannot establish control of object: context deadline exceeded
  Normal   BindClusterRole    16s (x4 over 2m16s)  rbac/providerrevision.pkg.crossplane.io      Bound system ClusterRole to provider ServiceAccount(s)
  Normal   ApplyClusterRoles  16s (x3 over 2m15s)  rbac/providerrevision.pkg.crossplane.io      Applied RBAC ClusterRoles
  Warning  SyncPackage        16s                  packages/providerrevision.pkg.crossplane.io  cannot establish control of object: rate: Wait(n=1) would exceed context deadline

How can we reproduce it?

kubectl crossplane install provider turkenh/provider-tf-aws:daf1e9f7-2

What environment did it happen in?

Crossplane version: v1.3.1 Kubernetes Cluster: Kind - v1.21.1

Possibly related: https://github.com/crossplane-contrib/terrajet/issues/47

turkenh commented 2 years ago

I've updated the issue description, as originally I tried to install provider with controller image 🤦

jbw976 commented 2 years ago

This is still occurring for some folks (see #3061) - we should reconsider the hard-coded timeout value and expose it as a configurable value.

negz commented 2 years ago

Let's make the context deadline configurable via a flag and also bump the default up a little - maybe to 5 mins for now? Ideally we'd avoid having configurable deadlines for only only a subset of reconcilers so maybe we can set one value for all of them? That said I could be convinced that package revisions are a special case.

We might also consider bumping the default Crossplane deployment's CPU resources up - we currently request 150m cores by default (a small portion of one core) which is quite miniscule.

jbw976 commented 2 years ago

@haarchri brought this up today in the community meeting, as he and others in the community are seeing similar behavior in #2849 and feeling the pain around it.

negz commented 2 years ago

Thanks @haarchri and @jbw976! This is going to be a priority ASAP.

negz commented 2 years ago

I just chatted with @avalanche123, who is going to take a shot at this! Per my comment at https://github.com/crossplane/crossplane/issues/2564#issuecomment-1165007894 the quick fix here is likely going to be to make this deadline configurable. That said, I think it would be interesting to check in with @hasheddan to see whether he can think of any low hanging fruit to speed up the reconcile process - perhaps parallelizing/batching CRD installs?

negz commented 2 years ago

We've had 1-2 reports of this still happening, so I'm reopening to dig into that.

negz commented 2 years ago

Specifically seeing this now (I'm using upbound/crossplane:v1.9.0-up.2, which should have https://github.com/crossplane/crossplane/pull/3176):

0s          Warning   SyncPackage               providerrevision/crossplane-provider-jet-azure-b7b0db0f74e0   cannot establish control of object: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
negz commented 2 years ago

Just confirmed I'm seeing the same with upbound/crossplane:v1.9.0-up.2 too. Both tests using kind v1.24.0 clusters.

negz commented 2 years ago

And yet crossplane/crossplane:v1.10.0-rc.0.31.g781e909f works as I would expect. 🤔 I guess perhaps this issue is somehow isolated to UXP. Will close and move discussion there.

haarchri commented 2 years ago

@negz we see the warning as well in 1.9.0 after a few loops it is working without handmade interactions

Mitsuwa commented 1 year ago

I seem to be running into this again trying to do a fresh install of

   Image:                          xpkg.upbound.io/upbound/provider-aws:v0.35.0
  Normal   BindClusterRole    21m (x46 over 20h)     rbac/providerrevision.pkg.crossplane.io      Bound system ClusterRole to provider ServiceAccount(s)
  Normal   ApplyClusterRoles  17m (x68 over 20h)     rbac/providerrevision.pkg.crossplane.io      Applied RBAC ClusterRoles
  Warning  SyncPackage        2m23s (x380 over 19h)  packages/providerrevision.pkg.crossplane.io  cannot establish control of object: Post "https://10.131.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions?dryRun=All": context deadline exceeded