Improve TTR and performance when large number of MRs

What problem are you facing?

Hello, I'm working on project requiring to manage in GCP thansand of objects. I wanted to test Crossplane's behaviour with this number of objects created. I made a benchmark on the GCP provider benchmark_gcp_18-10-2023. The project contains installation and configuration code. There is a noticeable performance problem when deploying a large number of resources (more than 1000). Deployment can take several hours and is resource-hungry in terms of CPU usage.

Details of the infrastructure:

Date: 18/10/2023
Platform: Google Cloud Platform (GCP), GKE Cluster Standard
Version: K8s: 1.27.3-gke.100, Crossplane: v1.13, provider-gcp-dns:v0.37.0
Setting: Default, with Crossplane ControllerConfig: --debug and --enable-management-policies
Compute Machine type: e2-highcpu-16 (16vCPU/16Go)

On the run I have called Run N°1: 4000 RecordSets: Create then Delete the deployment of 4000 RecordSet took 1 hour 30 minutes. As in many of my runs, the provider never consumes more than 10vCPU. There seems to be a software limit. I should point out that the provider pod has no limit or request defined. The behavior appears in GKE Standard cluster and K3s cluster (Run N°2)

Top diagram: CPU usage of the worker node. Botton diagram: Number of API calls reveiced by GCP for the DNS API endpoints

After the deployment completion, CPU consumption is still 60% CPU usage (=10vCPU) for 6 API call/s.

On the run I have called Run N°2: 9500 RecordSets: Only create, deployment of 9500 RecordSet took 4 hours to complete. i'm here using a K3s hosted on an external public cloud provider. The behavior is the same: Lock at 10vCPU usage for 6req/s to GCP.

Top diagram: Number of API calls reveiced by GCP for the DNS API endpoints. Botton diagram: CPU usage of the worker node.

In this case, likely due to lack of CPU, the observed time to reconcile (TTR) for MRs reaches 20 minutes.

Issue:

Pod CPU never go higher than 10vCPU usage.
CPU consumption is disproportionate to the number of objects to be managed. This makes it difficult to imagine using the provider in a real use case.

How could Official GCP Provider help solve your problem?

The latest version of upjet v1.0.0 appears to bring changes to the architecture usage of Terraform. The performance result provided on the new release of the AWS provider (https://github.com/upbound/provider-aws/releases/tag/v0.44.0) is close to the performance that could be obtained with a native Crossplane provider.

If the GCP provider could implement the new version of upjet v1.0.0, I could redo the benchmarks to compare.

@arfevrier thanks for reporting these. As you noted Upjet 1.0.0 brought with it significant performance improvements. We are currently working on implementing it in provider-gcp, see https://github.com/upbound/provider-gcp/pull/424, and expect to see similar levels of performance improvement.

We're validating the resources currently against the new architecture and fixing issues we pick up.

While there's no clear delivery date for it yet, I would expect it would be ready in ~2-3 weeks.

If you're willing to run your benchmark on an early release of the provider we can make some images available to you sooner.

Hi @arfevrier, Thank you for conducting these scale tests and reporting back, very much appreciated.

As you and @jeanduplessis mentioned above, we would like to roll out the new upjet runtime architecture to provider-gcp in this PR. The PR is still in-progress as we are validating the provider's MRs under the new architecture but I've given a try to the RecordSet.dns resource using that PR and it seems to be working as expected. I used these manifests to provision a Google Cloud DNS zone and a type A DNS record in that zone with success. I also checked whether there are any alterations in the external-name annotation that I mentioned here and I did not observe any, so looks like the external-name annotation for the RecordSet.dns resource is stable. But please keep in mind that I did these observations under the new upjet architecture. I also tried deleting the DNS record with success and I've also observed the corresponding external resource is successfully deleted (in reference to the Terraform documentation I shared in that comment).

If you'd like to give these resource providers a try, here are the packages built from that PR:

index.docker.io/ulucinar/provider-gcp-storage:v0.39.0-6984467fd36117d2b73e7fe78ae4cadc9e23f236
index.docker.io/ulucinar/provider-family-gcp:v0.39.0-6984467fd36117d2b73e7fe78ae4cadc9e23f236
index.docker.io/ulucinar/provider-gcp-dns:v0.39.0-6984467fd36117d2b73e7fe78ae4cadc9e23f236
index.docker.io/ulucinar/provider-gcp-compute:v0.39.0-6984467fd36117d2b73e7fe78ae4cadc9e23f236

If you need any other GCP packages, we can provide them. Here's the command I used to build those packages for your reference:

make SUBPACKAGES="config compute dns storage" XPKG_REG_ORGS=index.docker.io/ulucinar BRANCH_NAME="main" VERSION="v0.39.0-6984467fd36117d2b73e7fe78ae4cadc9e23f236" build.all publish

Please note that this command has been run in the feature branch of the PR and 6984467fd36117d2b73e7fe78ae4cadc9e23f236 is currently the head of the feature branch ulucinar/no-fork-v4.77.0.

Hello, Thank you for publishing this build! I've run a new benchmark with this version benchmark_gcp_04-12-2023. I'm not really satisfied with the availability of the cluster provided by GKE GCP. The benchmark results are moderately interpretable, because the GKE cluster API server can't handle the number of calls and/or the number of Kubernetes objects I'm trying to deploy (50,000 RecordSets). The API server becomes unavailable for several minutes every hour. Each spike in the graph corresponds to the API server becoming unavailable. I already had this behaviour when I was deploying providers with a lot of CRDs. This is GCP specific. The API server is managed by GCP, we can't customise any parameters. I use kubeburner to create 10 GCP recordset per second. This create too mush call for the GCP DNS API. And the provider received 50% of 429 Too Many Requests. kubeburner-gcp-upjet-1 0 0-api

But what we can see on the performance graph:

CPU consumption in the crossplane-system namespace is less than 1vCPU. This is significantly better than the previous benchmark.
The provider-familly-gcp is consuming more CPU time than provider-gcp-dns. Maybe this is because the GCP DNS API return lot of error ? It would be interesting to find out why. Because if you deploy several provider of the GCP family, you don't want the provider familly to saturate.

In terms of memory, the provider provider-gcp-dns is using 4Gb for 50 000 external resource. Next steps: I'm going to try to do a clean benchmark with a dedicated K8s cluster. And where I install the API server myself. It would be interesting to deploy Buckets along with Recordsets to see how the provider-familly-gcp behaves.

Hello @ulucinar, I perform new benchmark on a working K3s cluster. Performance has been significantly improved. You can find the result here benchmark_gcp_06-12-2023.

I've detected three problems with the provider:

There is a memory leak on the provider-gcp-dns. After multiple retry of the benchmark, the provider consome more and more RAM: Example with 3 Run of the 10000 Recordsets benchmark

I cannot deploy GCP Bucket external ressource. This is the error:

2023-12-06T13:03:26Z    DEBUG   provider-gcp    Calling the inner handler for Create event.     {"gvk": "storage.gcp.upbound.io/v1beta1, Kind=Bucket", "name": "test-bucket-e48s7z56a81t45", "queueLength": 0}
2023-12-06T13:03:26Z    DEBUG   provider-gcp    Reconciling     {"controller": "managed/storage.gcp.upbound.io/v1beta1, kind=bucket", "request": {"name":"test-bucket-e48s7z56a81t45"}}
2023-12-06T13:03:26Z    DEBUG   provider-gcp    Connecting to the service provider      {"uid": "5db7e8d2-a5bd-45ed-ae80-71e29afb8961", "name": "test-bucket-e48s7z56a81t45", "gvk": "storage.gcp.upbound.io/v1beta1, Kind=Bucket"}
2023-12-06T13:03:26Z    DEBUG   provider-gcp    Calling the inner handler for Update event.     {"gvk": "storage.gcp.upbound.io/v1beta1, Kind=Bucket", "name": "test-bucket-e48s7z56a81t45", "queueLength": 0}
2023/12/06 13:03:27 [INFO] Authenticating using configured Google JSON 'credentials'...
2023/12/06 13:03:27 [INFO]   -- Scopes: [https://www.googleapis.com/auth/cloud-platform https://www.googleapis.com/auth/userinfo.email]
2023/12/06 13:03:27 [INFO] Authenticating using configured Google JSON 'credentials'...
2023/12/06 13:03:27 [INFO]   -- Scopes: [https://www.googleapis.com/auth/cloud-platform https://www.googleapis.com/auth/userinfo.email]
2023/12/06 13:03:27 [DEBUG] Waiting for state to become: [success]
2023/12/06 13:03:27 [INFO] Terraform is using this identity: test-terraform@sbx-31371-hxctg7ma6kten29zme1l.iam.gserviceaccount.com
2023-12-06T13:03:27Z    DEBUG   provider-gcp    Instance state not found in cache, reconstructing...    {"uid": "5db7e8d2-a5bd-45ed-ae80-71e29afb8961", "name": "test-bucket-e48s7z56a81t45", "gvk": "storage.gcp.upbound.io/v1beta1, Kind=Bucket"}
2023-12-06T13:03:27Z    DEBUG   provider-gcp    Observing the external resource {"uid": "5db7e8d2-a5bd-45ed-ae80-71e29afb8961", "name": "test-bucket-e48s7z56a81t45", "gvk": "storage.gcp.upbound.io/v1beta1, Kind=Bucket"}
2023/12/06 13:03:27 [DEBUG] Waiting for state to become: [success]
2023/12/06 13:03:27 [INFO] Instantiating Google Storage client for path https://storage.googleapis.com/storage/v1/
2023/12/06 13:03:27 [DEBUG] Retry Transport: starting RoundTrip retry loop
2023/12/06 13:03:27 [DEBUG] Retry Transport: request attempt 0
2023/12/06 13:03:27 [DEBUG] Retry Transport: Stopping retries, last request failed with non-retryable error: googleapi: got HTTP response code 404 with body: HTTP/2.0 404 Not Found
[...]
{"error":{"code":404,"message":"The specified bucket does not exist.","errors":[{"message":"The specified bucket does not exist.","domain":"global","reason":"notFound"}]}}
2023/12/06 13:03:27 [DEBUG] Retry Transport: Returning after 1 attempts
2023/12/06 13:03:27 [DEBUG] Dismissed an error as retryable. Retry 404s for bucket read - googleapi: Error 404: The specified bucket does not exist., notFound
2023/12/06 13:03:27 [TRACE] Waiting 500ms before next try
2023/12/06 13:03:28 [INFO] Instantiating Google Storage client for path https://storage.googleapis.com/storage/v1/
2023/12/06 13:03:28 [DEBUG] Retry Transport: starting RoundTrip retry loop
2023/12/06 13:03:28 [DEBUG] Retry Transport: request attempt 0
2023/12/06 13:03:28 [DEBUG] Retry Transport: Stopping retries, last request failed with non-retryable error: googleapi: got HTTP response code 404 with body: HTTP/2.0 404 Not Found
[...]
{"error":{"code":404,"message":"The specified bucket does not exist.","errors":[{"message":"The specified bucket does not exist.","domain":"global","reason":"notFound"}]}}

This is the deployed resource:

apiVersion: storage.gcp.upbound.io/v1beta1
kind: Bucket
metadata:
  name: test-bucket-e48s7z56a81t45
  labels:
    source: kube-burner
spec:
  forProvider:
    forceDestroy: true
    location: EU
    publicAccessPrevention: enforced
    uniformBucketLevelAccess: true

I have another issue of orphaned external resources. I think this is link to the issue upjet#304. During the external resource creation of the GCP Recordset benchmarkthree-aruma3-2075, the external resource creation have been requested, but in "Async create ended" the tfID is empty. After, the provider try to create in loop a resource which already exist. In benchmarkthree-aruma2-2075 this tfID contains the correct ID.

Why the tfID is empty ?

Thank you for your feedback :)

This provider repo does not have enough maintainers to address every issue. Since there has been no activity in the last 90 days it is now marked as stale. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

This issue is being closed since there has been no activity for 14 days since marking it as stale. If you still need help, feel free to comment or reopen the issue!

crossplane-contrib / provider-upjet-gcp

Improve TTR and performance when large number of MRs #427

What problem are you facing?

How could Official GCP Provider help solve your problem?