crossplane / terrajet

Generate Crossplane Providers from any Terraform Provider
https://crossplane.io
Apache License 2.0
289 stars 38 forks source link

Terrajet providers take a long time on cluster with many resources #300

Open Kasama opened 1 year ago

Kasama commented 1 year ago

Hello team!

Me and my team are developing a solution for developers to be able to manage infrastructure resources through an API. This solution relies on crossplane to manage cloud resources as kubernetes CRs.

We've started using terrajet for AWS, but changed to the native provider because of this issue. New we also need to manage other resources for which there are no native providers (namely vault, boundary and datadog) and it would be great to use terrajet.

Existing Context

initial slack thread with problem description

What happened?

The expected behaviour is that terrajet resources time-to-readiness wouldn't depend on the amount of resources that already exist in the cluster.

In reality, the terrajet controller takes roughly 3 * (number of existing resources) * (time it takes for one resource to be reconciled) seconds to consider a new resource ready.

The examples in this issue are using provider-jet-aws, but all terrajet providers have the same behaviour, from what I can see.

Below are the debug logs related to one recently created policy (terrajetloadtestslow). It was created on a provider that is already managing 200 other aws policies.

crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630252994777613e+09       DEBUG   provider-jet-aws        Reconciling     {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630253028353422e+09       DEBUG   provider-jet-aws        refresh ended   {"workspace": "/tmp/636bbd3e-94bb-40ae-acde-f866e095b0e6", "out": "{\"@level\":\"info\",\"@message\":\"Terraform 1.0.5\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:20.558331Z\",\"terraform\":\"1.0.5\",\"type\":\"version\",\"ui\":\"0.1.0\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refreshing state...\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.822179Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\"},\"type\":\"refresh_start\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refresh complete\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.822992Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null}},\"type\":\"refresh_complete\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Drift detected (delete)\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.823374Z\",\"change\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"action\":\"delete\"},\"type\":\"resource_drift\"}\n{\"@level\":\"info\",\"@message\":\"Plan: 0 to add, 0 to change, 0 to destroy.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.823487Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"plan\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Apply complete! Resources: 0 added, 0 changed, 0 destroyed.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.829034Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"apply\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Outputs: 0\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.829081Z\",\"outputs\":{},\"type\":\"outputs\"}\n"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630253059385006e+09       DEBUG   provider-jet-aws        apply ended     {"workspace": "/tmp/636bbd3e-94bb-40ae-acde-f866e095b0e6", "out": "{\"@level\":\"info\",\"@message\":\"Terraform 1.0.5\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:22.897931Z\",\"terraform\":\"1.0.5\",\"type\":\"version\",\"ui\":\"0.1.0\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Plan to create\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.127798Z\",\"change\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"action\":\"create\"},\"type\":\"planned_change\"}\n{\"@level\":\"info\",\"@message\":\"Plan: 1 to add, 0 to change, 0 to destroy.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.127975Z\",\"changes\":{\"add\":1,\"change\":0,\"remove\":0,\"operation\":\"plan\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Creating...\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.630013Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"action\":\"create\"},\"type\":\"apply_start\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Creation complete after 0s [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.924980Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"action\":\"create\",\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\",\"elapsed_seconds\":0},\"type\":\"apply_complete\"}\n{\"@level\":\"info\",\"@message\":\"Apply complete! Resources: 1 added, 0 changed, 0 destroyed.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.932091Z\",\"changes\":{\"add\":1,\"change\":0,\"remove\":0,\"operation\":\"apply\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Outputs: 0\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:28:25.932195Z\",\"outputs\":{},\"type\":\"outputs\"}\n"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.663025305947192e+09        DEBUG   provider-jet-aws        Successfully requested creation of external resource {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow", "uid": "636bbd3e-94bb-40ae-acde-f866e095b0e6", "version": "573953565", "external-name": "", "external-name": "arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630253059475644e+09       DEBUG   events  Normal  {"object": {"kind":"Policy","name":"terrajetloadtestslow","uid":"636bbd3e-94bb-40ae-acde-f866e095b0e6","apiVersion":"iam.aws.jet.crossplane.io/v1alpha2","resourceVersion":"573965831"}, "reason": "CreatedExternalResource", "message": "Successfully requested creation of external resource"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.663025826864625e+09        DEBUG   provider-jet-aws        Reconciling     {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630258282078466e+09       DEBUG   provider-jet-aws        refresh ended   {"workspace": "/tmp/636bbd3e-94bb-40ae-acde-f866e095b0e6", "out": "{\"@level\":\"info\",\"@message\":\"Terraform 1.0.5\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:06.895864Z\",\"terraform\":\"1.0.5\",\"type\":\"version\",\"ui\":\"0.1.0\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refreshing state... [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.112383Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\"},\"type\":\"refresh_start\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refresh complete [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.197666Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\"},\"type\":\"refresh_complete\"}\n{\"@level\":\"info\",\"@message\":\"Plan: 0 to add, 0 to change, 0 to destroy.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.198356Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"plan\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Apply complete! Resources: 0 added, 0 changed, 0 destroyed.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.203956Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"apply\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Outputs: 0\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:37:08.203989Z\",\"outputs\":{},\"type\":\"outputs\"}\n"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630258282080562e+09       DEBUG   provider-jet-aws        External resource is up to date {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow", "uid": "636bbd3e-94bb-40ae-acde-f866e095b0e6", "version": "573965832", "external-name": "arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow", "requeue-after": 1663025888.2080543}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630263118990378e+09       DEBUG   provider-jet-aws        Reconciling     {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630263132372632e+09       DEBUG   provider-jet-aws        refresh ended   {"workspace": "/tmp/636bbd3e-94bb-40ae-acde-f866e095b0e6", "out": "{\"@level\":\"info\",\"@message\":\"Terraform 1.0.5\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:11.952075Z\",\"terraform\":\"1.0.5\",\"type\":\"version\",\"ui\":\"0.1.0\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refreshing state... [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.135934Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\"},\"type\":\"refresh_start\"}\n{\"@level\":\"info\",\"@message\":\"aws_iam_policy.terrajetloadtestslow: Refresh complete [id=arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow]\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.226661Z\",\"hook\":{\"resource\":{\"addr\":\"aws_iam_policy.terrajetloadtestslow\",\"module\":\"\",\"resource\":\"aws_iam_policy.terrajetloadtestslow\",\"implied_provider\":\"aws\",\"resource_type\":\"aws_iam_policy\",\"resource_name\":\"terrajetloadtestslow\",\"resource_key\":null},\"id_key\":\"id\",\"id_value\":\"arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow\"},\"type\":\"refresh_complete\"}\n{\"@level\":\"info\",\"@message\":\"Plan: 0 to add, 0 to change, 0 to destroy.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.227599Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"plan\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Apply complete! Resources: 0 added, 0 changed, 0 destroyed.\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.232695Z\",\"changes\":{\"add\":0,\"change\":0,\"remove\":0,\"operation\":\"apply\"},\"type\":\"change_summary\"}\n{\"@level\":\"info\",\"@message\":\"Outputs: 0\",\"@module\":\"terraform.ui\",\"@timestamp\":\"2022-09-12T23:45:13.232734Z\",\"outputs\":{},\"type\":\"outputs\"}\n"}
crossplane-provider-jet-aws-599c4ea494ca-55b9dd885f-zr2pb provider-jet-aws 1.6630263132466455e+09       DEBUG   provider-jet-aws        External resource is up to date {"controller": "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "request": "/terrajetloadtestslow", "uid": "636bbd3e-94bb-40ae-acde-f866e095b0e6", "version": "573976982", "external-name": "arn:aws:iam::ACCOUNT_NUMBER:policy/terrajetloadtestslow", "requeue-after": 1663026373.2466433}

From these logs we can also see the time it took for the resource to be Ready. From 23:28:19 to 23:45:13, ~17 minutes. (using GMT time to make it easier)

It should also be noted that the resource is created in the cloud provider at 23:28:25 (12 seconds), when the logs say Successfully requested creation of external resource, but it takes a long time until the resource is marked as Ready in kubernetes

The reason why it takes so long seems to be that resource change events are not given any priority over the normal reconciliation loop. The controller stays in an endless loop reconciling every resource it knows about, which is necessary to override external changes in the providers, but any resource that is added or changed in kubernetes doesn't get immediately picked up. Rather it seems to be added to the queue and only seen after all existing reconcile loops are done, then the resource is picked up, and a creation flow starts, but the result will only be seen on the next reconciliation loop. Once that comes, terrajet will pick up the resource's status from the terraform state and requeue once more, only after this last requeue the status will be updated to Ready.

It's also worth noting that changing the MaxConcurrentReconciles configuration to something bigger, like 5 or 10 reduces the time by that factor, but the underlying behavior persists and all resources are reconciled before considering new resource or updates Ready.

What I'm having trouble understanding is how this terrajet workflow differs from the native provider-aws one, as both seem to use the same Reconcile function from crossplane-runtime.

Also, the UseAsync configuration for a resource doesn't seem to change anything. I've tried both for policies.iam and for repositories.ecr with the same behavior.


How can we reproduce it?

The last step will take a long time, which is the problem this bug report is about.

Open collapsible for reproducible commands ```bash # Create kind cluster kind create cluster --name terrajet-load --image kindest/node:v1.23.10 # Install crossplane with helm kubectl create namespace crossplane-system helm repo add crossplane-stable https://charts.crossplane.io/stable helm repo update helm upgrade --install crossplane --namespace crossplane-system crossplane-stable/crossplane # Install AWS terrajet provider kubectl apply -f - <

Possibly related discussions


Please let me know if I can help figure out what's happening.

Kasama commented 1 year ago

Hello team. I want to update this issue with my latest findings.

After a while debugging and understanding a bit better how everything works, I figured the main issue seems to be that terrajet providers, in general, don't work fast enough to be able to consume the controller's workqueue faster than it's produced. Which causes the delays we're seeing as, for any change, the controller needs to go through all the workqueue before it can get to the recent change.

I've setup two side-by-side AWS providers, one using terrajet and another using provider-aws. Then I've setup a prometheus and grafana to collect the metrics exposed by the controller manager.

Open for more details on the experiment setup ### Experiment environment - I ran the experiments in my local machine with a 12 threads CPU (Ryzen 5600X) and 16GB of ram. None were topped during the tests. - I created a kind cluster to run the experiments on (similar to the main message setup). - I ran each provider directly on my machine (from the `./_output/bin/linux_amd64/provider` build artifact) - to allow for both providers to run at the same time, I've added the `MetricsBindAddress` option to the controller manager so both could expose metrics. I've used port `8080` (default) for the jet provider and `8081` for the native one - Unless specified, the providers were executed without additional flags or code changes - Relevant environment variables for jet provider: - `TF_PLUGIN_CACHE_DIR=/tmp/terraform/.terraform` to improve provider download performance - `TERRAFORM_VERSION=1.0.5` - `TERRAFORM_PROVIDER_SOURCE=hashicorp/aws` - `TERRAFORM_PROVIDER_VERSION=3.56.0` ### Metrics Setup The grafana dashboard was created pointing to the prometheus instance
Configuration files Prometheus.yaml ```yaml # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. external_labels: monitor: 'monitor' rule_files: [] scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'jet-provider' static_configs: - targets: ['localhost:8080'] - job_name: 'native-provider' static_configs: - targets: ['localhost:8081'] ``` Grafana dashboard: ```json { "annotations": { "list": [ { "builtIn": 1, "datasource": { "type": "grafana", "uid": "-- Grafana --" }, "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "target": { "limit": 100, "matchAny": false, "tags": [], "type": "dashboard" }, "type": "dashboard" }, { "datasource": { "type": "datasource", "uid": "grafana" }, "enable": true, "iconColor": "blue", "name": "Annotations", "target": { "limit": 1000, "refId": "Anno", "tags": [ "annotation" ], "type": "tags" } } ] }, "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 0, "id": 2, "links": [], "liveNow": false, "panels": [ { "collapsed": false, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 }, "id": 10, "panels": [], "title": "Controller", "type": "row" }, { "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "s" }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 1 }, "id": 6, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "editorMode": "builder", "expr": "rate(controller_runtime_reconcile_time_seconds_sum{controller=~\"$controllers\"}[$__rate_interval])", "hide": true, "legendFormat": "__auto", "range": true, "refId": "A" }, { "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "editorMode": "builder", "expr": "controller_runtime_max_concurrent_reconciles{controller=~\"$controllers\"}", "hide": true, "legendFormat": "__auto", "range": true, "refId": "B" }, { "datasource": { "name": "Expression", "type": "__expr__", "uid": "__expr__" }, "expression": "$A/$B", "hide": false, "refId": "C", "type": "math" } ], "title": "Controller Reconcile Time", "type": "timeseries" }, { "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "max": 10, "min": -1, "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 1 }, "id": 12, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "editorMode": "builder", "expr": "controller_runtime_reconcile_errors_total{controller=~\"$controllers\"}", "legendFormat": "__auto", "range": true, "refId": "A" } ], "title": "Reconcile Errors", "type": "timeseries" }, { "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "s" }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 9 }, "id": 16, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "editorMode": "builder", "expr": "histogram_quantile(0.95, sum by(le, controller) (rate(controller_runtime_reconcile_time_seconds_bucket{controller=~\"$controllers\"}[$__rate_interval])))", "hide": true, "legendFormat": "__auto", "range": true, "refId": "A" }, { "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "editorMode": "builder", "expr": "controller_runtime_max_concurrent_reconciles{controller=~\"$controllers\"}", "hide": true, "legendFormat": "__auto", "range": true, "refId": "B" }, { "datasource": { "name": "Expression", "type": "__expr__", "uid": "__expr__" }, "expression": "$A/$B", "hide": false, "refId": "C", "type": "math" } ], "title": "95th percentile reconcile time", "type": "timeseries" }, { "gridPos": { "h": 1, "w": 24, "x": 0, "y": 17 }, "id": 8, "title": "Queue", "type": "row" }, { "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "ms" }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 18 }, "id": 4, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "editorMode": "builder", "expr": "rate(workqueue_queue_duration_seconds_sum{name=~\"$controllers\"}[$__rate_interval])", "legendFormat": "__auto", "range": true, "refId": "A" } ], "title": "Workqueue queue duration", "type": "timeseries" }, { "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 18 }, "id": 2, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "editorMode": "builder", "expr": "workqueue_depth{name=~\"$controllers\"}", "legendFormat": "__auto", "range": true, "refId": "A" } ], "title": "Workqueue depth", "type": "timeseries" }, { "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 26 }, "id": 14, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "editorMode": "builder", "expr": "rate(workqueue_adds_total{name=~\"$controllers\"}[$__rate_interval])", "legendFormat": "__auto", "range": true, "refId": "A" } ], "title": "Rate of Queue Adds", "type": "timeseries" } ], "refresh": false, "schemaVersion": 37, "style": "dark", "tags": [], "templating": { "list": [ { "allValue": "managed/.*", "current": { "selected": false, "text": [ "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "managed/policy.iam.aws.crossplane.io" ], "value": [ "managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy", "managed/policy.iam.aws.crossplane.io" ] }, "datasource": { "type": "prometheus", "uid": "cdbCIg4Vk" }, "definition": "controller_runtime_active_workers", "description": "Controllers to watch. Regex", "hide": 0, "includeAll": true, "label": "controller", "multi": true, "name": "controllers", "options": [], "query": { "query": "controller_runtime_active_workers", "refId": "StandardVariableQuery" }, "refresh": 1, "regex": ".*controller=\"(managed/.*?)\".*", "skipUrlSync": false, "sort": 0, "type": "query" } ] }, "time": { "from": "2022-10-03T04:52:47.659Z", "to": "2022-10-03T05:09:38.612Z" }, "timepicker": {}, "timezone": "", "title": "Crossplane Controllers AWS", "uid": "-WIVl0VVk", "version": 2, "weekStart": "" } ```

In the following graphs, the green lines represent the jet provider, while the yellow line represents the native one.

Initially I've created 100 resources managed by each provider (first vertical blue line). There, we can see that the reconciliation time for the terrajet provider is much higher than the native provider. After some time I've changed the jet provider to have a MaxConcurrentReconciles: 10 in the configs (second blue line).

reconciliation time on the first section (after the first vertical blue line), the reconciliation time is about ~14.8s for the jet provider and ~84ms for the native provider.

on the second section (after the second vertical blue line), the reconcilicaiton time is about ~997ms for the jet provider and ~84ms for the native provider. workqueue depth on the first section (after the first vertical blue line), the workqueue depth is about 100 items for the jet provider and 0 items for the native provider.

on the second section (after the second vertical blue line), the workqueue depth stabilizes at about 25 items for the jet provider and 0 for the native provider.


This left the question of "What if we bump the number of MaxConcurrentReconciles further". Bumping it to 50 we see that the reconcile time didn't change, but the number of items in the queue were reduced to 0 on the terrajet provider as well. reconcile time with 50 workers depth with 50 reconcilers


This, however, is not maintained when we create even more resources. When jumping to 500 policies on each provider, we see that the reconcile time is, again, not affected. But the workqueue size is significantly affected and cannot go below 500 for the terrajet provider, while the native provider is momentarily affected by the change, but is able to catch up and clear the queue. reconcile time 500 resources workqueue


To make sure that the reconcile time was the main culprit here, I've also locally changed the Observe method of the native provider to include a time.Sleep(1 * time.Second) and it had the same queue behavior as the terrajet provider. Which reflected in the same long time to reconcile new resources we saw in terrajet-based providers.


I'm curious to see which solutions we can brew up to this.

For now it seems that terrajet-based providers are not suitable for environments that need to manage many resources of the same type.

Kasama commented 1 year ago

I cross-posted this issue in upjet as it will (as far as I understand) replace terrajet in the future.