Closed mateusz-lubanski-sinch closed 1 year ago
I noticed that after setting crossplane --max-reconcile-rate=1
container flag amount of logs gets back to it's original throughput
Might be related to https://github.com/crossplane/crossplane/issues/2595
Hi, looks like we got this issue too after upgrading to 1.10.1.
Can you explain how exactly did you configure --max-reconcile-rate=1
, by container flag, did you mean applying the following configuration on the Crossplane deployment resource?
name: crossplane-pod
namespace: crossplane
spec:
containers:
- image: crossplane/crossplane:v1.10.1
name: crossplane
...
args:
- core
- start
- --max-reconcile-rate=1
Is there any option to configure it via the helm chart?
Thanks :)
Hi
We found this issue while investigating a huge increase in our Cloudwatch Logs bill which we tracked back to a recent Crossplane upgrade from 1.8.1 to 1.10.1
I used Cloudwatch Logs Insights to sum Kubernetes audit log traffic for system:serviceaccount:crossplane-system:crossplane
over a few 1-hour sample periods on the same cluster with more or less the same amount of Crossplane AWS resources defined:
This is a >14x increase!
I have done a bit more digging and found a few key points:
we have 4 clusters running crossplane, each in a different environment, each with a differing amount of resources under management (all are AWS, with no XRDs), a spread of 9x from smallest to largest:
However, in all clusters, the audit logs have increased to about the same level, which suggests that the audit log size is not proportional to the amount of resources under management, it's more or less fixed.
rank | action | count |
---|---|---|
1 | update apiextensions.k8s.io/v1/customresourcedefinitions | 16480 |
2 | update coordination.k8s.io/v1/leases | 1783 |
3 | get coordination.k8s.io/v1/leases | 1783 |
4 | update pkg.crossplane.io/v1/providerrevisions | 104 |
5 | patch /v1/serviceaccounts | 49 |
6 | patch apps/v1/deployments | 49 |
7 | watch apps/v1/deployments | 18 |
8 | watch pkg.crossplane.io/v1beta1/locks | 18 |
9 | watch apiextensions.crossplane.io/v1/compositeresourcedefinitions | 18 |
10 | watch pkg.crossplane.io/v1alpha1/controllerconfigs | 16 |
11 | watch /v1/serviceaccounts | 16 |
12 | watch pkg.crossplane.io/v1/configurationrevisions | 16 |
13 | watch pkg.crossplane.io/v1/providerrevisions | 16 |
14 | watch pkg.crossplane.io/v1/configurations | 16 |
15 | watch apiextensions.k8s.io/v1/customresourcedefinitions | 14 |
16 | watch pkg.crossplane.io/v1/providers | 14 |
17 | update pkg.crossplane.io/v1beta1/locks | 12 |
18 | patch pkg.crossplane.io/v1/providerrevisions | 3 |
19 | update pkg.crossplane.io/v1/providers | 3 |
20 | delete apps/v1/deployments | 3 |
21 | delete /v1/services | 3 |
22 | delete /v1/serviceaccounts | 3 |
rank | action | count |
---|---|---|
1 | update apiextensions.k8s.io/v1/customresourcedefinitions | 179688 |
2 | get coordination.k8s.io/v1/leases | 1771 |
3 | update coordination.k8s.io/v1/leases | 1771 |
4 | update pkg.crossplane.io/v1/providerrevisions | 1234 |
5 | update pkg.crossplane.io/v1/providers | 557 |
6 | patch pkg.crossplane.io/v1/providerrevisions | 557 |
7 | delete apps/v1/deployments | 555 |
8 | delete /v1/services | 555 |
9 | delete /v1/serviceaccounts | 555 |
10 | watch apiextensions.k8s.io/v1/customresourcedefinitions | 18 |
11 | watch pkg.crossplane.io/v1/configurationrevisions | 16 |
12 | watch apps/v1/deployments | 16 |
13 | watch pkg.crossplane.io/v1alpha1/controllerconfigs | 16 |
14 | watch /v1/serviceaccounts | 16 |
15 | watch pkg.crossplane.io/v1/providerrevisions | 16 |
16 | watch pkg.crossplane.io/v1beta1/locks | 16 |
17 | watch apiextensions.crossplane.io/v1/compositeresourcedefinitions | 14 |
18 | watch pkg.crossplane.io/v1/configurations | 14 |
19 | watch pkg.crossplane.io/v1/providers | 14 |
20 | patch /v1/serviceaccounts | 3 |
21 | patch apps/v1/deployments | 3 |
This is only the calls from the service account used by crossplane core, which is the main source of audit log bytes. The only provider we have installed (AWS) uses a separate service account which generates 119x fewer audit log bytes, and the volume of which is almost the same (only 4% different) between v1.8.1 and v1.10.1
Crossplane core sends ~180,000 CRD update calls per hour ... why would it possibly need to do that?
CRD update calls are really expensive in audit log bytes because the each entry contains the whole CRD.
(In case anybody else wants to try, here's the AWS Cloudwatch Logs Insights query I used to calculate the above)
fields @timestamp, @message, @logStream, strlen(@message) as size, concat(verb, ' ', objectRef.apiGroup, '/', objectRef.apiVersion, '/', objectRef.resource) as action
| filter @logStream like 'kube-apiserver-audit'
| filter user.username='system:serviceaccount:crossplane-system:crossplane'
| stats count(*) as count by action
| sort by count desc
For reference this is on EKS 1.22
I found a solution to my problem, without resorting to globally throttling the reconcile frequency ...
After reading https://github.com/crossplane/crossplane/issues/3407 which mentions continuous provider reconciliations and recommends setting debug logs on crossplane core, I did that.
The debug logs shows an AWS provider attempt to be reconciled every 6-7 seconds:
17:56:00 DEBUG crossplane.events Normal {"object": {"kind":"ProviderRevision","name":"aws-808bafe360d1","uid":"36a15843-20c0-4479-9455-bb5a44d86866","apiVersion":"pkg.crossplane.io/v1","resourceVersion":"2676278682"}, "reason": "SyncPackage", "message": "Successfully configured package revision"}
17:56:00 DEBUG crossplane Reconciling {"controller": "packages/providerrevision.pkg.crossplane.io", "request": "/aws-808bafe360d1"}
17:56:00 DEBUG crossplane Reconciling {"controller": "packages/provider.pkg.crossplane.io", "request": "/aws"}
17:56:00 DEBUG crossplane.events Normal {"object": {"kind":"Provider","name":"aws","uid":"77383a9e-fefc-48ab-b1c8-70f93029116c","apiVersion":"pkg.crossplane.io/v1","resourceVersion":"2562063122"}, "reason": "InstallPackageRevision", "message": "Successfully installed package revision"}
17:56:06 DEBUG crossplane.events Normal {"object": {"kind":"ProviderRevision","name":"aws-808bafe360d1","uid":"36a15843-20c0-4479-9455-bb5a44d86866","apiVersion":"pkg.crossplane.io/v1","resourceVersion":"2676278977"}, "reason": "SyncPackage", "message": "Successfully configured package revision"}
17:56:06 DEBUG crossplane Reconciling {"controller": "packages/providerrevision.pkg.crossplane.io", "request": "/aws-808bafe360d1"}
17:56:06 DEBUG crossplane Reconciling {"controller": "packages/provider.pkg.crossplane.io", "request": "/aws"}
17:56:06 DEBUG crossplane.events Normal {"object": {"kind":"Provider","name":"aws","uid":"77383a9e-fefc-48ab-b1c8-70f93029116c","apiVersion":"pkg.crossplane.io/v1","resourceVersion":"2562063122"}, "reason": "InstallPackageRevision", "message": "Successfully installed package revision"}
17:56:13 DEBUG crossplane.events Normal {"object": {"kind":"ProviderRevision","name":"aws-808bafe360d1","uid":"36a15843-20c0-4479-9455-bb5a44d86866","apiVersion":"pkg.crossplane.io/v1","resourceVersion":"2676279303"}, "reason": "SyncPackage", "message": "Successfully configured package revision"}
17:56:13 DEBUG crossplane Reconciling {"controller": "packages/providerrevision.pkg.crossplane.io", "request": "/aws-808bafe360d1"}
17:56:13 DEBUG crossplane Reconciling {"controller": "packages/provider.pkg.crossplane.io", "request": "/aws"}
17:56:13 DEBUG crossplane.events Normal {"object": {"kind":"Provider","name":"aws","uid":"77383a9e-fefc-48ab-b1c8-70f93029116c","apiVersion":"pkg.crossplane.io/v1","resourceVersion":"2562063122"}, "reason": "InstallPackageRevision", "message": "Successfully installed package revision"
I noticed however that this was not the ID of our current provider and then noticed I had 2 providerrevision objects for the AWS provider when I only expected to have one:
$ kubectl get providerrevisions -n crossplane-system
NAME HEALTHY REVISION IMAGE STATE DEP-FOUND DEP-INSTALLED AGE
aws-6707d06fe75f True 6 crossplanecontrib/provider-aws:v0.34.0 Active 27d
aws-808bafe360d1 True 5 crossplane/provider-aws:v0.27.1 Inactive 204d
Once I deleted the older of the 2 providerrevision objects (and restarted crossplane controller to be sure) my problem completely disappeared:
I don't know if this is relevant or applicable to the original problem mentioned but it certainly might be worth checking?
(also, I think I need to add a graph to monitor Cloudwatch log volume like @mateusz-lubanski-sinch as this was an expensive thing not to notice for us...)
Thank you very much @andyspiers for sharing those findings and evidence! good debugging work! 💪
I'm definitely surprised to see ProviderRevision aws-808bafe360d1
doing any reconciliation work considering it is in the Inactive state. I don't think that is expected.
One thing to note here is that the older "inactive" provider revision is for image crossplane/provider-aws
and the newer active one is crossplanecontrib/provider-aws
. Perhaps some confusion was introduced to the package manager's reconciliation of providers because of that image location change?
Curious to see what @hasheddan thinks of these findings 🤔
ah, i see now that the reconciliation work is regarding the packages/providers themselves - that reinforces the evidence of the top API endpoint (by 2 orders of magnitude) being update apiextensions.k8s.io/v1/customresourcedefinitions
.
I previously assumed the log statements were for the old/inactive provider-aws doing reconciliation work on managed resources in AWS. Since it's the CRDs and contents of the packages, it does seem like the package manager is in a confused state between the two provider revisions and possibly switching image location.
I'm definitely surprised to see ProviderRevision aws-808bafe360d1 doing any reconciliation work considering it is in the Inactive state. I don't think that is expected.
It is expected that we still reconcile old ProviderRevisions as we ensure that their CRDs are still present and that they are an owner of the CRD. Because CRDs are not ever directly deleted by the package manager, which defers to garbage collection to clean them up, we try to avoid a state in which an inactive package revision has some but not all of its CRDs installed, which could happen if we didn't continue reconciling and some CRDs were manually deleted. However, in the happy path this behavior should not cause the ongoing reconciliation described here, even if the newer revision and the older one were fighting over CRD structure (we don't enqueue on modification to CRDs).
Something I notice here is that we are also constantly reconciling the parent Provider, which appears to be triggering the old ProviderRevision reconcile. I'm wondering if perhaps the upgrade in Crossplane versions led to a constant diff on a ProviderRevision that was installed by a previous version of Crossplane, which could get us into an ongoing loop like this. I'll do some investigating to see if I can reproduce.
Thank you for the depth of investigation here @mateusz-lubanski-sinch and @andyspiers, and the helpful analysis @jbw976!
Thanks @hasheddan and @jbw976 for your insight and analysis. I've had to fix the issue (by removing the inactive provider revision) in the affected clusters to save money so I can't test directly any more, but I'm happy to share more details of what we would have had installed from our GitOps history if it helps.
Thank you all for sharing.
ah, i see now that the reconciliation work is regarding the packages/providers themselves - that reinforces the evidence of the top API endpoint (by 2 orders of magnitude) being
update apiextensions.k8s.io/v1/customresourcedefinitions
.I previously assumed the log statements were for the old/inactive provider-aws doing reconciliation work on managed resources in AWS. Since it's the CRDs and contents of the packages, it does seem like the package manager is in a confused state between the two provider revisions and possibly switching image location.
I don't think it is related to the image location (might worsen the situation maybe) since this happens to me with this setup:
> k get providerrevisions
NAME HEALTHY REVISION IMAGE STATE DEP-FOUND DEP-INSTALLED AGE
aws-2eb57c36ee2c True 2 crossplane/provider-aws:v0.33.0-rc.0.12.ga1a214fa Active 7d
aws-cdcc7b10da13 True 1 crossplane/provider-aws:v0.33.0-rc.0.4.g1d993448 Inactive 46d
kafka-e97633b5a9d1 True 1 crossplane/provider-kafka:v0.1.0-10.gd7a3355 Active 46d
kubernetes-f935b3d8b7ec True 1 crossplane/provider-kubernetes:v0.5.0-rc.13.g7d48dad Active 46d
sql-16c18954b33d True 2 crossplane/provider-sql:v0.5.0-10.g031c0bd Active 7d
sql-18fa0d6354d0 True 1 crossplane/provider-sql:v0.4.1-30.ged6c55c Inactive 46d
terraform-a6475e3a5f6f True 1 crossplane/provider-terraform:v0.4.0-rc.0.2.gf3bd7e0 Active 34d
To anyone with many providers/clusters/both:
kubectl get providerrevision -o jsonpath='{.items[?(@.spec.desiredState=="Inactive")].metadata.name}' | xargs kubectl delete providerrevision
Hey folks, I was able to investigate and determine a fix here, which we will backport and include in a v1.10
patch release. You can read the full write-up in https://github.com/crossplane/crossplane/pull/3582. Thank you again for providing such in-depth information on the behavior you were observing!
What happened?
After upgrading crossplane from version
1.9.1
to1.10.1
we observed huge increase in audit logs:What environment did it happen in?
Cluster: EKS
1.21.x
Crossplane:1.10.1
AWS Provider:v0.34.0
Provisioned Claims:2
Provisioned ManagedResources:6
how crossplane can help solve this problem
This is a blocker for as as it increases our costs: not only for CloudWatch but also for our centralised logging system As of today EKS does not support below functionalities which could mitigate that problem:
Is there any crossplane configuration option which will reduce amount of audit logs produced by crossplane?