crossplane / crossplane

The Cloud Native Control Plane
https://crossplane.io
Apache License 2.0
9.51k stars 958 forks source link

crossplane 1.10.1 drastically increases amount of kubernetes audit logs #3540

Closed mateusz-lubanski-sinch closed 1 year ago

mateusz-lubanski-sinch commented 1 year ago

What happened?

After upgrading crossplane from version 1.9.1 to 1.10.1 we observed huge increase in audit logs:

image

What environment did it happen in?

Cluster: EKS 1.21.x Crossplane: 1.10.1 AWS Provider: v0.34.0 Provisioned Claims: 2 Provisioned ManagedResources: 6

how crossplane can help solve this problem

This is a blocker for as as it increases our costs: not only for CloudWatch but also for our centralised logging system As of today EKS does not support below functionalities which could mitigate that problem:

Is there any crossplane configuration option which will reduce amount of audit logs produced by crossplane?

mateusz-lubanski-sinch commented 1 year ago

I noticed that after setting crossplane --max-reconcile-rate=1 container flag amount of logs gets back to it's original throughput

image

Might be related to https://github.com/crossplane/crossplane/issues/2595

rezonick commented 1 year ago

Hi, looks like we got this issue too after upgrading to 1.10.1. Can you explain how exactly did you configure --max-reconcile-rate=1, by container flag, did you mean applying the following configuration on the Crossplane deployment resource?

  name: crossplane-pod
  namespace: crossplane
  spec:
    containers:
    - image: crossplane/crossplane:v1.10.1
       name: crossplane
       ...
       args:
       - core
       - start
       - --max-reconcile-rate=1

Is there any option to configure it via the helm chart?

Thanks :)

andyspiers commented 1 year ago

Hi

We found this issue while investigating a huge increase in our Cloudwatch Logs bill which we tracked back to a recent Crossplane upgrade from 1.8.1 to 1.10.1

I used Cloudwatch Logs Insights to sum Kubernetes audit log traffic for system:serviceaccount:crossplane-system:crossplane over a few 1-hour sample periods on the same cluster with more or less the same amount of Crossplane AWS resources defined:

This is a >14x increase!

andyspiers commented 1 year ago

I have done a bit more digging and found a few key points:

Audit log volume is independent of resources under management

we have 4 clusters running crossplane, each in a different environment, each with a differing amount of resources under management (all are AWS, with no XRDs), a spread of 9x from smallest to largest:

However, in all clusters, the audit logs have increased to about the same level, which suggests that the audit log size is not proportional to the amount of resources under management, it's more or less fixed.

Nature of audit log calls

1 hr period on v1.8.1 and AWS provider v0.27.1

rank action count
1 update apiextensions.k8s.io/v1/customresourcedefinitions 16480
2 update coordination.k8s.io/v1/leases 1783
3 get coordination.k8s.io/v1/leases 1783
4 update pkg.crossplane.io/v1/providerrevisions 104
5 patch /v1/serviceaccounts 49
6 patch apps/v1/deployments 49
7 watch apps/v1/deployments 18
8 watch pkg.crossplane.io/v1beta1/locks 18
9 watch apiextensions.crossplane.io/v1/compositeresourcedefinitions 18
10 watch pkg.crossplane.io/v1alpha1/controllerconfigs 16
11 watch /v1/serviceaccounts 16
12 watch pkg.crossplane.io/v1/configurationrevisions 16
13 watch pkg.crossplane.io/v1/providerrevisions 16
14 watch pkg.crossplane.io/v1/configurations 16
15 watch apiextensions.k8s.io/v1/customresourcedefinitions 14
16 watch pkg.crossplane.io/v1/providers 14
17 update pkg.crossplane.io/v1beta1/locks 12
18 patch pkg.crossplane.io/v1/providerrevisions 3
19 update pkg.crossplane.io/v1/providers 3
20 delete apps/v1/deployments 3
21 delete /v1/services 3
22 delete /v1/serviceaccounts 3

1 hr period on v1.10.1 and AWS provider v0.34.0

rank action count
1 update apiextensions.k8s.io/v1/customresourcedefinitions 179688
2 get coordination.k8s.io/v1/leases 1771
3 update coordination.k8s.io/v1/leases 1771
4 update pkg.crossplane.io/v1/providerrevisions 1234
5 update pkg.crossplane.io/v1/providers 557
6 patch pkg.crossplane.io/v1/providerrevisions 557
7 delete apps/v1/deployments 555
8 delete /v1/services 555
9 delete /v1/serviceaccounts 555
10 watch apiextensions.k8s.io/v1/customresourcedefinitions 18
11 watch pkg.crossplane.io/v1/configurationrevisions 16
12 watch apps/v1/deployments 16
13 watch pkg.crossplane.io/v1alpha1/controllerconfigs 16
14 watch /v1/serviceaccounts 16
15 watch pkg.crossplane.io/v1/providerrevisions 16
16 watch pkg.crossplane.io/v1beta1/locks 16
17 watch apiextensions.crossplane.io/v1/compositeresourcedefinitions 14
18 watch pkg.crossplane.io/v1/configurations 14
19 watch pkg.crossplane.io/v1/providers 14
20 patch /v1/serviceaccounts 3
21 patch apps/v1/deployments 3

Notes

(In case anybody else wants to try, here's the AWS Cloudwatch Logs Insights query I used to calculate the above)

fields @timestamp, @message, @logStream, strlen(@message) as size, concat(verb, ' ', objectRef.apiGroup, '/', objectRef.apiVersion, '/', objectRef.resource) as action
| filter @logStream like 'kube-apiserver-audit'
| filter user.username='system:serviceaccount:crossplane-system:crossplane'
| stats count(*) as count by action
| sort by count desc

For reference this is on EKS 1.22

andyspiers commented 1 year ago

I found a solution to my problem, without resorting to globally throttling the reconcile frequency ...

After reading https://github.com/crossplane/crossplane/issues/3407 which mentions continuous provider reconciliations and recommends setting debug logs on crossplane core, I did that.

The debug logs shows an AWS provider attempt to be reconciled every 6-7 seconds:

17:56:00 DEBUG    crossplane.events    Normal    {"object": {"kind":"ProviderRevision","name":"aws-808bafe360d1","uid":"36a15843-20c0-4479-9455-bb5a44d86866","apiVersion":"pkg.crossplane.io/v1","resourceVersion":"2676278682"}, "reason": "SyncPackage", "message": "Successfully configured package revision"}
17:56:00 DEBUG    crossplane    Reconciling    {"controller": "packages/providerrevision.pkg.crossplane.io", "request": "/aws-808bafe360d1"}
17:56:00 DEBUG    crossplane    Reconciling    {"controller": "packages/provider.pkg.crossplane.io", "request": "/aws"}
17:56:00 DEBUG    crossplane.events    Normal    {"object": {"kind":"Provider","name":"aws","uid":"77383a9e-fefc-48ab-b1c8-70f93029116c","apiVersion":"pkg.crossplane.io/v1","resourceVersion":"2562063122"}, "reason": "InstallPackageRevision", "message": "Successfully installed package revision"}
17:56:06 DEBUG    crossplane.events    Normal    {"object": {"kind":"ProviderRevision","name":"aws-808bafe360d1","uid":"36a15843-20c0-4479-9455-bb5a44d86866","apiVersion":"pkg.crossplane.io/v1","resourceVersion":"2676278977"}, "reason": "SyncPackage", "message": "Successfully configured package revision"}
17:56:06 DEBUG    crossplane    Reconciling    {"controller": "packages/providerrevision.pkg.crossplane.io", "request": "/aws-808bafe360d1"}
17:56:06 DEBUG    crossplane    Reconciling    {"controller": "packages/provider.pkg.crossplane.io", "request": "/aws"}
17:56:06 DEBUG    crossplane.events    Normal    {"object": {"kind":"Provider","name":"aws","uid":"77383a9e-fefc-48ab-b1c8-70f93029116c","apiVersion":"pkg.crossplane.io/v1","resourceVersion":"2562063122"}, "reason": "InstallPackageRevision", "message": "Successfully installed package revision"}
17:56:13 DEBUG    crossplane.events    Normal    {"object": {"kind":"ProviderRevision","name":"aws-808bafe360d1","uid":"36a15843-20c0-4479-9455-bb5a44d86866","apiVersion":"pkg.crossplane.io/v1","resourceVersion":"2676279303"}, "reason": "SyncPackage", "message": "Successfully configured package revision"}
17:56:13 DEBUG    crossplane    Reconciling    {"controller": "packages/providerrevision.pkg.crossplane.io", "request": "/aws-808bafe360d1"}
17:56:13 DEBUG    crossplane    Reconciling    {"controller": "packages/provider.pkg.crossplane.io", "request": "/aws"}
17:56:13 DEBUG    crossplane.events    Normal    {"object": {"kind":"Provider","name":"aws","uid":"77383a9e-fefc-48ab-b1c8-70f93029116c","apiVersion":"pkg.crossplane.io/v1","resourceVersion":"2562063122"}, "reason": "InstallPackageRevision", "message": "Successfully installed package revision"

I noticed however that this was not the ID of our current provider and then noticed I had 2 providerrevision objects for the AWS provider when I only expected to have one:

$ kubectl get providerrevisions -n crossplane-system
NAME               HEALTHY   REVISION   IMAGE                                    STATE      DEP-FOUND   DEP-INSTALLED   AGE
aws-6707d06fe75f   True      6          crossplanecontrib/provider-aws:v0.34.0   Active                                 27d
aws-808bafe360d1   True      5          crossplane/provider-aws:v0.27.1          Inactive                               204d

Once I deleted the older of the 2 providerrevision objects (and restarted crossplane controller to be sure) my problem completely disappeared:

I don't know if this is relevant or applicable to the original problem mentioned but it certainly might be worth checking?

(also, I think I need to add a graph to monitor Cloudwatch log volume like @mateusz-lubanski-sinch as this was an expensive thing not to notice for us...)

jbw976 commented 1 year ago

Thank you very much @andyspiers for sharing those findings and evidence! good debugging work! 💪

I'm definitely surprised to see ProviderRevision aws-808bafe360d1 doing any reconciliation work considering it is in the Inactive state. I don't think that is expected.

One thing to note here is that the older "inactive" provider revision is for image crossplane/provider-aws and the newer active one is crossplanecontrib/provider-aws. Perhaps some confusion was introduced to the package manager's reconciliation of providers because of that image location change?

Curious to see what @hasheddan thinks of these findings 🤔

jbw976 commented 1 year ago

ah, i see now that the reconciliation work is regarding the packages/providers themselves - that reinforces the evidence of the top API endpoint (by 2 orders of magnitude) being update apiextensions.k8s.io/v1/customresourcedefinitions.

I previously assumed the log statements were for the old/inactive provider-aws doing reconciliation work on managed resources in AWS. Since it's the CRDs and contents of the packages, it does seem like the package manager is in a confused state between the two provider revisions and possibly switching image location.

hasheddan commented 1 year ago

I'm definitely surprised to see ProviderRevision aws-808bafe360d1 doing any reconciliation work considering it is in the Inactive state. I don't think that is expected.

It is expected that we still reconcile old ProviderRevisions as we ensure that their CRDs are still present and that they are an owner of the CRD. Because CRDs are not ever directly deleted by the package manager, which defers to garbage collection to clean them up, we try to avoid a state in which an inactive package revision has some but not all of its CRDs installed, which could happen if we didn't continue reconciling and some CRDs were manually deleted. However, in the happy path this behavior should not cause the ongoing reconciliation described here, even if the newer revision and the older one were fighting over CRD structure (we don't enqueue on modification to CRDs).

Something I notice here is that we are also constantly reconciling the parent Provider, which appears to be triggering the old ProviderRevision reconcile. I'm wondering if perhaps the upgrade in Crossplane versions led to a constant diff on a ProviderRevision that was installed by a previous version of Crossplane, which could get us into an ongoing loop like this. I'll do some investigating to see if I can reproduce.

Thank you for the depth of investigation here @mateusz-lubanski-sinch and @andyspiers, and the helpful analysis @jbw976!

andyspiers commented 1 year ago

Thanks @hasheddan and @jbw976 for your insight and analysis. I've had to fix the issue (by removing the inactive provider revision) in the affected clusters to save money so I can't test directly any more, but I'm happy to share more details of what we would have had installed from our GitOps history if it helps.

gchait commented 1 year ago

Thank you all for sharing.

ah, i see now that the reconciliation work is regarding the packages/providers themselves - that reinforces the evidence of the top API endpoint (by 2 orders of magnitude) being update apiextensions.k8s.io/v1/customresourcedefinitions.

I previously assumed the log statements were for the old/inactive provider-aws doing reconciliation work on managed resources in AWS. Since it's the CRDs and contents of the packages, it does seem like the package manager is in a confused state between the two provider revisions and possibly switching image location.

I don't think it is related to the image location (might worsen the situation maybe) since this happens to me with this setup:

> k get providerrevisions
NAME                      HEALTHY   REVISION   IMAGE                                                  STATE      DEP-FOUND   DEP-INSTALLED   AGE
aws-2eb57c36ee2c          True      2          crossplane/provider-aws:v0.33.0-rc.0.12.ga1a214fa      Active                                 7d
aws-cdcc7b10da13          True      1          crossplane/provider-aws:v0.33.0-rc.0.4.g1d993448       Inactive                               46d
kafka-e97633b5a9d1        True      1          crossplane/provider-kafka:v0.1.0-10.gd7a3355           Active                                 46d
kubernetes-f935b3d8b7ec   True      1          crossplane/provider-kubernetes:v0.5.0-rc.13.g7d48dad   Active                                 46d
sql-16c18954b33d          True      2          crossplane/provider-sql:v0.5.0-10.g031c0bd             Active                                 7d
sql-18fa0d6354d0          True      1          crossplane/provider-sql:v0.4.1-30.ged6c55c             Inactive                               46d
terraform-a6475e3a5f6f    True      1          crossplane/provider-terraform:v0.4.0-rc.0.2.gf3bd7e0   Active                                 34d
gchait commented 1 year ago

To anyone with many providers/clusters/both:

kubectl get providerrevision -o jsonpath='{.items[?(@.spec.desiredState=="Inactive")].metadata.name}' | xargs kubectl delete providerrevision
hasheddan commented 1 year ago

Hey folks, I was able to investigate and determine a fix here, which we will backport and include in a v1.10 patch release. You can read the full write-up in https://github.com/crossplane/crossplane/pull/3582. Thank you again for providing such in-depth information on the behavior you were observing!