aquasecurity / trivy-operator

Kubernetes-native security toolkit
https://aquasecurity.github.io/trivy-operator/latest
Apache License 2.0
1.2k stars 200 forks source link

Missing workloads to be analysed when the cluster has a considerable number of apps running #1710

Closed borja-rivera closed 5 months ago

borja-rivera commented 9 months ago

What steps did you take and what happened:

I have a lab cluster with 71 active replicasets and 38 daemonsets and the most I have managed to cover is 58 counting resources of both types. I think it is not powerful enough to cover such a big cluster. I would like to know if you know what are the limitations in terms of cluster size, because in my case, I am trying to test the tool on a lab cluster that does not have too much workloads compared to prod.

Here's an example of logs:

# 1. Get StatefulSet workload from cache
2023-12-14T07:08:12Z    DEBUG   reconciler.vulnerabilityreport  Getting workload from cache {"kind": "StatefulSet", "name": {"name":"alertmanager-prom-op-istio-gateways-alertmanager","namespace":"istio-gateways"}}

# 2. Submit scan for the workload selected
2023-12-14T07:08:12Z    DEBUG   reconciler.vulnerabilityreport  Submitting a scan for the workload  {"kind": "StatefulSet", "name": {"name":"alertmanager-prom-op-istio-gateways-alertmanager","namespace":"istio-gateways"}, "podSpecHash": "6bc8c5bbcb"}

# 3. Checks if limits is OK. Yes in that case, only 1 workload scanning
2023-12-14T07:08:12Z    DEBUG   reconciler.vulnerabilityreport  Checking scan jobs limit    {"pre scan job processing for workload:": "alertmanager-prom-op-istio-gateways-alertmanager", "count": 1, "limit": 20}
"controllerKind": "StatefulSet", "controllerName": "alertmanager-prom-op-istio-gateways-alertmanager"}

# 4. Creates scan job
2023-12-14T07:08:12Z    DEBUG   reconciler.vulnerabilityreport  Creating scan job for the workload  {"kind": "StatefulSet", "name": "alertmanager-prom-op-istio-gateways-alertmanager", "namespace": "istio-gateways", "podSpecHash": "6bc8c5bbcb"}

# 5. Repeats step 1 (?)
2023-12-14T07:08:12Z    DEBUG   reconciler.vulnerabilityreport  Getting workload from cache {"kind": "StatefulSet", "name": {"name":"alertmanager-prom-op-istio-gateways-alertmanager","namespace":"istio-gateways"}}

# 6. Get StatefulSet workload from cache
2023-12-14T07:08:12Z    DEBUG   reconciler.vulnerabilityreport  Scan job already exists {"kind": "StatefulSet", "name": {"name":"alertmanager-prom-op-istio-gateways-alertmanager","namespace":"istio-gateways"}, "podSpecHash": "6bc8c5bbcb", "job": "trivy-system/scan-vulnerabilityreport-65c99bbcfd"}

# 7. Deletes job because of scan failed (why failed??)
2023-12-14T07:08:13Z    DEBUG   reconciler.scan job Deleting failed scan job    {"job": "trivy-system/scan-vulnerabilityreport-65c99bbcfd"}

I have installed trivy-operator with the latest helm chart. Here is my values.yaml :

excludeNamespaces: "default" operator: exposedSecretScannerEnabled: false configAuditScannerEnabled: false rbacAssessmentScannerEnabled: false infraAssessmentScannerEnabled: false metricsVulnIdEnabled: true logDevMode: true scanJobTTL: "24h" scanJobsConcurrentLimit: 20 service: metricsPort: 8080 serviceMonitor: enabled: true namespace: trivy-system labels: monitoredBy: monitoring-prometheus trivyOperator: scanJobCompressLogs: false trivy: command: filesystem additionalVulnerabilityReportFields: "Target,Class,PackagePath,PackageType" slow: false resources: requests: cpu: 300m memory: 300M limits: cpu: 700m memory: 700M securityContext: runAsUser: 0

What did you expect to happen: I expect vulnerability reports for all workloads on the cluster but it isn't working as expected

Environment:

chen-keinan commented 9 months ago

@borja-rivera do you have any logs for failed pod ?

chen-keinan commented 8 months ago

@borja-rivera any update ?

borja-rivera commented 8 months ago

@chen-keinan There are still workloads to be analysed. I disabled OPERATOR_SBOM_GENERATION_ENABLED but it still doesn't generate all the reports it should. I also excluded the kubernetes-specific kube-system namespace, but there are still apps missing.

chen-keinan commented 8 months ago

@borja-rivera disabling OPERATOR_SBOM_GENERATION_ENABLED will not help in your case as you have errors , could you please share more log data with error to so I could track error context ?

borja-rivera commented 8 months ago

@chen-keinan Another recurring error is:

2023-12-27T08:47:33Z    ERROR   reconciler.scan job Scan job container  {"job": "trivy-system/scan-vulnerabilityreport-684c445888", "container": "manager", "status.reason": "Error", "status.message": "2023-12-27T08:47:30.159Z\t\u001b[31mFATAL\u001b[0m\tinit error: cache error: unable to initialize the cache: unable to initialize fs cache: failed to create cache dir: mkdir /var/trivyoperator/trivy-db/fanal: permission denied\n"}
github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).processFailedScanJob
    /home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:320
github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).SetupWithManager.(*ScanJobController).reconcileJobs.func1
    /home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:81
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/reconcile/reconcile.go:111
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /home/runner/go/pkg/mod/[sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227](http://sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227)

2023-12-27T08:47:33Z    DEBUG   reconciler.scan job Deleting failed scan job    {"job": "trivy-system/scan-vulnerabilityreport-684c445888"}
chen-keinan commented 8 months ago

@borja-rivera I see a basic permission error for creating cache folder , are you running special config , can you share you config maps and operator deployment EnvVars ?

2023-12-27T08:47:33Z    ERROR   reconciler.scan job Scan job container  {"job": "trivy-system/scan-vulnerabilityreport-684c445888", "container": "manager", "status.reason": "Error", "status.message": "2023-12-27T08:47:30.159Z\t\u001b[31mFATAL\u001b[0m\tinit error: cache error: unable to initialize the cache: unable to initialize fs cache: failed to create cache dir: mkdir /var/trivyoperator/trivy-db/fanal: permission denied\n"}
borja-rivera commented 8 months ago

@chen-keinan Here's the trivy-operator-trivy-config config map:

apiVersion: v1 data: trivy.additionalVulnerabilityReportFields: Target,Class,PackagePath,PackageType trivy.command: filesystem trivy.dbRepository: ghcr.io/aquasecurity/trivy-db trivy.dbRepositoryInsecure: "false" trivy.filesystemScanCacheDir: /var/trivyoperator/trivy-db trivy.imagePullPolicy: IfNotPresent trivy.imageScanCacheDir: /tmp/trivy/.cache trivy.javaDbRepository: ghcr.io/aquasecurity/trivy-java-db trivy.mode: Standalone trivy.repository: ghcr.io/aquasecurity/trivy trivy.resources.limits.cpu: 700m trivy.resources.limits.memory: 700M trivy.resources.requests.cpu: 300m trivy.resources.requests.memory: 300M trivy.severity: UNKNOWN,LOW,MEDIUM,HIGH,CRITICAL trivy.skipJavaDBUpdate: "false" trivy.slow: "false" trivy.supportedConfigAuditKinds: Workload,Service,Role,ClusterRole,NetworkPolicy,Ingress,LimitRange,ResourceQuota trivy.tag: 0.47.0 trivy.timeout: 5m0s trivy.useBuiltinRegoPolicies: "true" kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: trivy-operator meta.helm.sh/release-namespace: trivy-system creationTimestamp: "2023-12-27T08:46:22Z" labels: app.kubernetes.io/instance: trivy-operator app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: trivy-operator app.kubernetes.io/version: 0.17.1 helm.sh/chart: trivy-operator-0.19.1 name: trivy-operator-trivy-config namespace: trivy-system resourceVersion: "579337566" uid: 0146087f-48f8-45ae-8993-1e4e311f17ca

And trivy-operator config map:

apiVersion: v1 data: compliance.failEntriesLimit: "10" node.collector.imageRef: ghcr.io/aquasecurity/node-collector:0.0.9 nodeCollector.volumeMounts: '[{"mountPath":"/var/lib/etcd","name":"var-lib-etcd","readOnly":true},{"mountPath":"/var/lib/kubelet","name":"var-lib-kubelet","readOnly":true},{"mountPath":"/var/lib/kube-scheduler","name":"var-lib-kube-scheduler","readOnly":true},{"mountPath":"/var/lib/kube-controller-manager","name":"var-lib-kube-controller-manager","readOnly":true},{"mountPath":"/etc/systemd","name":"etc-systemd","readOnly":true},{"mountPath":"/lib/systemd/","name":"lib-systemd","readOnly":true},{"mountPath":"/etc/kubernetes","name":"etc-kubernetes","readOnly":true},{"mountPath":"/etc/cni/net.d/","name":"etc-cni-netd","readOnly":true}]' nodeCollector.volumes: '[{"hostPath":{"path":"/var/lib/etcd"},"name":"var-lib-etcd"},{"hostPath":{"path":"/var/lib/kubelet"},"name":"var-lib-kubelet"},{"hostPath":{"path":"/var/lib/kube-scheduler"},"name":"var-lib-kube-scheduler"},{"hostPath":{"path":"/var/lib/kube-controller-manager"},"name":"var-lib-kube-controller-manager"},{"hostPath":{"path":"/etc/systemd"},"name":"etc-systemd"},{"hostPath":{"path":"/lib/systemd"},"name":"lib-systemd"},{"hostPath":{"path":"/etc/kubernetes"},"name":"etc-kubernetes"},{"hostPath":{"path":"/etc/cni/net.d/"},"name":"etc-cni-netd"}]' report.recordFailedChecksOnly: "true" scanJob.podTemplateContainerSecurityContext: '{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]},"privileged":false,"readOnlyRootFilesystem":true}' vulnerabilityReports.scanner: Trivy kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: trivy-operator meta.helm.sh/release-namespace: trivy-system creationTimestamp: "2023-12-27T08:46:22Z" labels: app.kubernetes.io/instance: trivy-operator app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: trivy-operator app.kubernetes.io/version: 0.17.1 helm.sh/chart: trivy-operator-0.19.1 name: trivy-operator namespace: trivy-system resourceVersion: "579337565" uid: 457a6483-6f94-43ad-97ed-19fa8140c03b

borja-rivera commented 8 months ago

Env:

  - env:
    - name: OPERATOR_NAMESPACE
      value: trivy-system
    - name: OPERATOR_TARGET_NAMESPACES
    - name: OPERATOR_EXCLUDE_NAMESPACES
      value: default,kube-system
    - name: OPERATOR_TARGET_WORKLOADS
      value: pod,replicaset,replicationcontroller,statefulset,daemonset,cronjob,job
    - name: OPERATOR_SERVICE_ACCOUNT
      value: trivy-operator
    - name: OPERATOR_LOG_DEV_MODE
      value: "true"
    - name: OPERATOR_SCAN_JOB_TTL
      value: 24h
    - name: OPERATOR_SCAN_JOB_TIMEOUT
      value: 5m
    - name: OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT
      value: "20"
    - name: OPERATOR_CONCURRENT_NODE_COLLECTOR_LIMIT
      value: "1"
    - name: OPERATOR_SCAN_JOB_RETRY_AFTER
      value: 30s
    - name: OPERATOR_BATCH_DELETE_LIMIT
      value: "10"
    - name: OPERATOR_BATCH_DELETE_DELAY
      value: 10s
    - name: OPERATOR_METRICS_BIND_ADDRESS
      value: :8080
    - name: OPERATOR_METRICS_FINDINGS_ENABLED
      value: "true"
    - name: OPERATOR_METRICS_VULN_ID_ENABLED
      value: "true"
    - name: OPERATOR_HEALTH_PROBE_BIND_ADDRESS
      value: :9090
    - name: OPERATOR_VULNERABILITY_SCANNER_ENABLED
      value: "true"
    - name: OPERATOR_SBOM_GENERATION_ENABLED
      value: "false"
    - name: OPERATOR_VULNERABILITY_SCANNER_SCAN_ONLY_CURRENT_REVISIONS
      value: "true"
    - name: OPERATOR_SCANNER_REPORT_TTL
      value: 24h
    - name: OPERATOR_CACHE_REPORT_TTL
      value: 120h
    - name: CONTROLLER_CACHE_SYNC_TIMEOUT
      value: 5m
    - name: OPERATOR_CONFIG_AUDIT_SCANNER_ENABLED
      value: "false"
    - name: OPERATOR_RBAC_ASSESSMENT_SCANNER_ENABLED
      value: "false"
    - name: OPERATOR_INFRA_ASSESSMENT_SCANNER_ENABLED
      value: "false"
    - name: OPERATOR_CONFIG_AUDIT_SCANNER_SCAN_ONLY_CURRENT_REVISIONS
      value: "true"
    - name: OPERATOR_EXPOSED_SECRET_SCANNER_ENABLED
      value: "false"
    - name: OPERATOR_METRICS_EXPOSED_SECRET_INFO_ENABLED
      value: "false"
    - name: OPERATOR_METRICS_CONFIG_AUDIT_INFO_ENABLED
      value: "false"
    - name: OPERATOR_METRICS_RBAC_ASSESSMENT_INFO_ENABLED
      value: "false"
    - name: OPERATOR_METRICS_INFRA_ASSESSMENT_INFO_ENABLED
      value: "false"
    - name: OPERATOR_METRICS_IMAGE_INFO_ENABLED
      value: "false"
    - name: OPERATOR_METRICS_CLUSTER_COMPLIANCE_INFO_ENABLED
      value: "false"
    - name: OPERATOR_WEBHOOK_BROADCAST_URL
    - name: OPERATOR_WEBHOOK_BROADCAST_TIMEOUT
      value: 30s
    - name: OPERATOR_SEND_DELETED_REPORTS
      value: "false"
    - name: OPERATOR_PRIVATE_REGISTRY_SCAN_SECRETS_NAMES
      value: '{}'
    - name: OPERATOR_ACCESS_GLOBAL_SECRETS_SERVICE_ACCOUNTS
      value: "true"
    - name: OPERATOR_BUILT_IN_TRIVY_SERVER
      value: "false"
    - name: TRIVY_SERVER_HEALTH_CHECK_CACHE_EXPIRATION
      value: 10h
    - name: OPERATOR_MERGE_RBAC_FINDING_WITH_CONFIG_AUDIT
      value: "false"
    - name: OPERATOR_CLUSTER_COMPLIANCE_ENABLED
      value: "true"
chen-keinan commented 8 months ago

@borja-rivera since you are running in filesystem mode , you'll have to run scanJobPodTemplateContainerSecurityContext to runAsUser: 0

you need to uncomment this flag see also in our docs First Option: Filesystem Scanning

borja-rivera commented 8 months ago

@chen-keinan After a few days with this option activated, there are workloads that it now analyses that it did not do before. However, it still doesn't analyse all the workloads, as I have made a parallel script to analyse all the pods with the trivy tool, and it is analysing 101 different images, while trivy-operator is analysing only 23.

borja-rivera commented 8 months ago

Excluding kube-system namespace, which reports a lot of vulns but it's not relevant for us at the moment

chen-keinan commented 8 months ago

@borja-rivera try this workaround for this issue, disable this helm flag. and restart operator Note: this bug has been fixed in latest v0.18.0-rc

borja-rivera commented 8 months ago

@chen-keinan I'm currently testing on it. Although there are some more reports than before, it's not scanning all of them. I'll let it scanning for few days more and then check any updates.

chen-keinan commented 8 months ago

@borja-rivera thanks , please do let me know if you get any errors in the logs

chen-keinan commented 7 months ago

@borja-rivera issue should be fixed with latest trivy-operator v0.18.2 can you please check and confirm

chen-keinan commented 5 months ago

Closing issue as it should be fixed with latests versions @borja-rivera feel free to open it if you thing that it has not been resolved.