Upjet providers can't consume workqueue fast enough. Causes huge time-to-readiness delay

Kasama commented 1 year ago

Cross-posting about crossplane/terrajet#300, because the exact same behavior happens with Upjet, and as far as I understood, terrajet will be deprecated in favor of Upjet, so it makes sense to keep this issue tracked here.

This is specially relevant as this now seems to be the "official" backend for provider implementations.

What happened?

The expected behaviour is that upjet resources time-to-readiness wouldn't depend on the amount of resources that already exist in the cluster.

In reality, as calling terraform takes a while (around 1 second on my tests), the provider controller is unable to clear the work queue and because of that, any new events (such as creating a new resource) takes very long to complete when there are multiple other resources, since the controller adds those to the end of the queue.

There are more details and on the original bug report

How can we reproduce it?

The reproduction steps are basically the same as the original issue, just changing the terrajet provider for the upjet provider.

Create a new kubernetes cluster (with kind or in the cloud).
Install crossplane
Install an upjet provider (I'll use AWS because it can be easily compared with the native provider-aws)
Create a handful of resources to be managed by upjet (I'll use iam Policy because it is quickly created and dont incur costs)
Wait until all resources are created and ready it will take some minutes, but a burst of resources is expected to take a bit. Althought it does take much longer than provider-aws for the same resource.
Create one more resource to be managed by upjet

The last step will take a long time, which is the problem this bug report is about.

Open collapsible for reproducible commands

```bash # Create kind cluster kind create cluster --name upjet-load --image kindest/node:v1.23.10 # Install crossplane with helm kubectl create namespace crossplane-system helm repo add crossplane-stable https://charts.crossplane.io/stable helm repo update helm upgrade --install crossplane --namespace crossplane-system crossplane-stable/crossplane # Install AWS upjet provider kubectl apply -f - <

jeanduplessis commented 1 year ago

Thank you for the detailed report @Kasama, we'll be taking a look at this in our next sprint starting next week.

Kasama commented 1 year ago

Great to hear that! Feel free to reach out either here or on crossplane's slack thread (@roberto.alegro) if I can help with more details or reproduction steps

muvaf commented 1 year ago

turkenh commented 1 year ago

Thanks a lot for your detailed analysis here @Kasama.

I believe a low-hanging fruit here is to set some reasonable defaults for MaxConcurrentReconciles and PollInterval configurations to cover common cases and then to ensure they are exposed as configuration parameters so that they can be tuned further depending on specific deployment cases.

muvaf commented 1 year ago

FYI https://github.com/upbound/upjet/issues/99 is also another thing that may cause CPU saturation.

turkenh commented 1 year ago

On a GKE cluster with e2-standard-4 machine type, I did run the following 3 experiments:

Experiment 1: With maxConcurrentReconciles=1 and pollInterval=1m (Current defaults)
Experiment 2: With maxConcurrentReconciles=10 and pollInterval=1m (Community providers defaults)
Experiment 3: With maxConcurrentReconciles=10 and pollInterval=10m (Proposed defaults)

There are definitely some improvements between Exp#1 and Exp#2 but TBH, I am a bit surprised with Exp#3 not being much different than Exp#2. I am wondering if this could be related to CPU being throttled for both cases. I am planning to repeat the two Experiments with larger nodes not to get throttled.

Experiment 1: With maxConcurrentReconciles=1 and pollInterval=1m (Current defaults)

Provisioned 100 ECR Repositories and it took ~19 mins until all of them become Ready. Wanted to delete 1 of them and took 9min until it got processed first time and ~18 min until deleted.

Experiment 2: With maxConcurrentReconciles=10 and pollInterval=1m (Community defaults)

Provisioned 100 ECR Repositories and it took ~12 Mins until all of them become Ready. Wanted to delete 1 of them and took ~1 min until it got processed first time and ~5 mins until deleted.

Experiment 3: With maxConcurrentReconciles=10 and pollInterval=10m (Proposed defaults)

Provisioned 100 ECR Repositories and it took ~12 Mins until all of them become Ready. Wanted to delete 1 of them and took ~1 min until it got processed first time and ~4 mins until deleted.

Kasama commented 1 year ago

Yeah, during my testing I've walked a similar path, I had changed pollInterval, but assumed it didn't have an impact. Maybe an even bigger interval, like 30m or 1h would yield different results, but that starts to become a bit unreasonable imo.

Indeed there are some improvements when bumping the concurrency, but sadly the problem still remains in that the time it takes for new resources to be ready greatly depends on the amount of already existing resources.

turkenh commented 1 year ago

I repeated the last experiment on a bigger node (e2-standard-32) to eliminate the effect of CPU throttling and this time it looks much better (except the resource consumption).

Experiment 4: With maxConcurrentReconciles=10 and pollInterval=10m (Proposed defaults) (on e2-standard-32)

Provisioned 100 ECR Repositories and it took ~2 Mins until all of them become Ready. Wanted to delete 1 of them and took ~5 secs until it got processed first time and ~10 secs until deleted.

I believe improving resource usage is something orthogonal with the settings here and I feel good with the above defaults while still exposing them as configurable params.

I'll open PRs with proposed defaults.

Kasama commented 1 year ago

I was finally able to do some more tests using a bigger instance (an m6i.8xlarge on aws in my case, similar to the e2-standard-32 you used). And I can confirm that running in a bigger instance does make the queue clear faster with the new defaults.

But when trying with ~5000 concurrent resources, there was still a similar problem and the queue was with ~700 resources at all times. That can be mitigated again by increasing the reconciliation time, but It would be much better to have a way to scale these controllers horizontally, though.

crossplane / upjet