Open Pionerd opened 7 months ago
@Pionerd have you tried setting resources for operator
Yes, resources are set. CPU usage is constantly close to whatever limit I set, up to 4 CPUs. Memory does not seem a problem.
I don't mind if the app needs to use CPU (usually we don't set CPU limits at all) but from the logs, to me it is not clear what the Operator is doing that it needs so much.
I do observe all Clusterrbacassessmentreports
being recreated constantly, none of them has an age of over 5 minutes
@Pionerd reports are created when resource is created/updated or reportTTL exceeded or report is deleted, any of it related to your case ?
No, none of those. It applies to literrally all ClusterRBACAssessmentReports. reportTTL
is the chart's default of 24h
, I don't see this behaviour e.g. for ConfigAuditReports
.
As mentioned above, also the logs (in DEBUG mode) are flooded with messages related to this:
2024-04-02T12:42:51Z DEBUG resourcecontroller Checking whether configuration audit report exists {"kind": "ClusterRole", "name": {"name":"system:node-problem-detector"}}
2024-04-02T12:42:52Z DEBUG resourcecontroller Checking whether configuration audit report exists {"kind": "ClusterRole", "name": {"name":"system:controller:replication-controller"}}
2024-04-02T12:42:53Z DEBUG resourcecontroller Checking whether configuration audit report exists {"kind": "ClusterRole", "name": {"name":"system:controller:resourcequota-controller"}}
etc.
etc.
@Pionerd the check for report is a part of processing, do you see report get generated e every few minutes ?
Yes
@Pionerd few of question:
Yes, they are exactly the same. The reports are deleted and then it takes up to a minute for them to reappear.
Yes, they are exactly the same. The reports are deleted and then it takes up to a minute for them to reappear. it make sense, as deletion of cluster
ClusterRbacAssessmentReport
report deletion will trigger a newClusterRole
scan. the only question is what is causing the deletion of the report. it can't be TTL as TTL do not reconcileClusterRbacAssessmentReport
onlyRbacAssessmentReport
could you think of on some outside process which delete theClusterRbacAssessmentReport
or theClusterRole
If I scale down the deployment of the operator to 0 replicas, the ClusterRbacAssessmentReports
remain constant (deletion stops). This makes the operator itself still a suspect to me.
This happens to all ClusterRoles, including all system ones, so I can't imagine how they could be temporarily deleted without causing all other kinds of symptoms in the system.
For your info: this is the same cluster as https://github.com/aquasecurity/trivy-operator/issues/1970 and even though InfraAssessments are enabled, they are not generated. So a few weird things seem to be going on.
@Pionerd Thanks for this update. I'll have a look regarding replicaset >0.
regarding infraassessment
reports on AKS probably will not be supported as cloud provider do not allow access to api-server, controller-manager. scheduler and etc
@chen-keinan I think I found the culprit. Based on https://github.com/aquasecurity/trivy-operator/issues/1742 I wrote a pretty extensive exception set. Is it possible that the operator does not handle this in a very cpu friendly way? Any possibilities to improve this (or the exception definitions in general)?
I'd rather not share the whole exception definition here (it shows our full stack), but I can share it with you if it helps. It contains 26 policy.ksvXXX_exclude_resources.rego
blocks, each with one or more exception[rules]
blocks.
@chen-keinan I think I found the culprit. Based on #1742 I wrote a pretty extensive exception set. Is it possible that the operator does not handle this in a very cpu friendly way? Any possibilities to improve this (or the exception definitions in general)?
I'd rather not share the whole exception definition here (it shows our full stack), but I can share it with you if it helps. It contains 26
policy.ksvXXX_exclude_resources.rego
blocks, each with one or moreexception[rules]
blocks.
sure, but you'll have to put here some real example so I could try to reproduce it
Please see your gmail.
Please see your gmail.
ok , got it now, I will take a look at it later
Unfortunately the issue reappeared in an environment without those exclusions. Would be really nice if we could make some progress on this issue, please let me know if and how I may assist.
@Pionerd can you try disable scanner by scanner to isolate the problem and see which one could cause performance issue ?
configAuditScannerEnabled: false
removes the symptoms.
Based on the above I did an additional test with enabling the configAudit, but disabling the exceptions sent to you earlier. Although the CPU usage is also high for the first few minutes, after that the usage seems to return to normal. Maybe you can look if you can reproduce the issue using those exceptions.
@Pionerd thanks for input, I'll have a look at it.
Hi @chen-keinan, I appreciate all your efforts for this project and suspect you got a lot on your plate. I'd be great if you could take a look at this issue, combined with the Rego exceptions I sent you ealier. In our experience this is the "best" way to document exceptions, so it's pretty vital to us (unless you have a better way to document exceptions, now or coming soon). TY!
@Pionerd the combine flag used to combine multi resource fo same policy, I'm not sure it can be achieved with operator pattern as it reconcile resource by resource and one is not aware to the other
The exceptions are working as intended, only at the cost of very high CPU usage. Do you suggest I should split them up?
@Pionerd I'll have to investigate it, it might take a while
I'm experiencing the same.
I've installed trivy via Flux as described here: https://aquasecurity.github.io/trivy/v0.54/tutorials/kubernetes/gitops/#fluxcd.
As soon as it installed, the cluster was almost unreachable via kubectl. See this dashboard showing a very high load average, disk IO, and network (nearly 100Mb/s for several mins!):
After about 15 mins things stabilised, but still unexpected.
What steps did you take and what happened:
Trivy Operator is using near 100% of CPU all the time. In debug mode the logs are flooded with messages like this (not sure if the operator is supposed to check this often) but for the rest I do not see anything significant happening
The average workload is recurring every ~3/4 seconds
What did you expect to happen: No high CPU usage in an idle situation
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Environment:
trivy-operator version
): 0.19.1kubectl version
): 1.28 AKS