Closed borja-rivera closed 5 months ago
@borja-rivera do you have any logs for failed pod ?
@borja-rivera any update ?
@chen-keinan There are still workloads to be analysed. I disabled OPERATOR_SBOM_GENERATION_ENABLED but it still doesn't generate all the reports it should. I also excluded the kubernetes-specific kube-system namespace, but there are still apps missing.
@borja-rivera disabling OPERATOR_SBOM_GENERATION_ENABLED
will not help in your case as you have errors , could you please share more log data with error to so I could track error context ?
@chen-keinan Another recurring error is:
2023-12-27T08:47:33Z ERROR reconciler.scan job Scan job container {"job": "trivy-system/scan-vulnerabilityreport-684c445888", "container": "manager", "status.reason": "Error", "status.message": "2023-12-27T08:47:30.159Z\t\u001b[31mFATAL\u001b[0m\tinit error: cache error: unable to initialize the cache: unable to initialize fs cache: failed to create cache dir: mkdir /var/trivyoperator/trivy-db/fanal: permission denied\n"}
github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).processFailedScanJob
/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:320
github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).SetupWithManager.(*ScanJobController).reconcileJobs.func1
/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:81
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/reconcile/reconcile.go:111
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/runner/go/pkg/mod/[sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227](http://sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227)
2023-12-27T08:47:33Z DEBUG reconciler.scan job Deleting failed scan job {"job": "trivy-system/scan-vulnerabilityreport-684c445888"}
@borja-rivera I see a basic permission error for creating cache folder , are you running special config , can you share you config maps and operator deployment EnvVars ?
2023-12-27T08:47:33Z ERROR reconciler.scan job Scan job container {"job": "trivy-system/scan-vulnerabilityreport-684c445888", "container": "manager", "status.reason": "Error", "status.message": "2023-12-27T08:47:30.159Z\t\u001b[31mFATAL\u001b[0m\tinit error: cache error: unable to initialize the cache: unable to initialize fs cache: failed to create cache dir: mkdir /var/trivyoperator/trivy-db/fanal: permission denied\n"}
@chen-keinan Here's the trivy-operator-trivy-config
config map:
apiVersion: v1 data: trivy.additionalVulnerabilityReportFields: Target,Class,PackagePath,PackageType trivy.command: filesystem trivy.dbRepository: ghcr.io/aquasecurity/trivy-db trivy.dbRepositoryInsecure: "false" trivy.filesystemScanCacheDir: /var/trivyoperator/trivy-db trivy.imagePullPolicy: IfNotPresent trivy.imageScanCacheDir: /tmp/trivy/.cache trivy.javaDbRepository: ghcr.io/aquasecurity/trivy-java-db trivy.mode: Standalone trivy.repository: ghcr.io/aquasecurity/trivy trivy.resources.limits.cpu: 700m trivy.resources.limits.memory: 700M trivy.resources.requests.cpu: 300m trivy.resources.requests.memory: 300M trivy.severity: UNKNOWN,LOW,MEDIUM,HIGH,CRITICAL trivy.skipJavaDBUpdate: "false" trivy.slow: "false" trivy.supportedConfigAuditKinds: Workload,Service,Role,ClusterRole,NetworkPolicy,Ingress,LimitRange,ResourceQuota trivy.tag: 0.47.0 trivy.timeout: 5m0s trivy.useBuiltinRegoPolicies: "true" kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: trivy-operator meta.helm.sh/release-namespace: trivy-system creationTimestamp: "2023-12-27T08:46:22Z" labels: app.kubernetes.io/instance: trivy-operator app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: trivy-operator app.kubernetes.io/version: 0.17.1 helm.sh/chart: trivy-operator-0.19.1 name: trivy-operator-trivy-config namespace: trivy-system resourceVersion: "579337566" uid: 0146087f-48f8-45ae-8993-1e4e311f17ca
And trivy-operator config map:
apiVersion: v1 data: compliance.failEntriesLimit: "10" node.collector.imageRef: ghcr.io/aquasecurity/node-collector:0.0.9 nodeCollector.volumeMounts: '[{"mountPath":"/var/lib/etcd","name":"var-lib-etcd","readOnly":true},{"mountPath":"/var/lib/kubelet","name":"var-lib-kubelet","readOnly":true},{"mountPath":"/var/lib/kube-scheduler","name":"var-lib-kube-scheduler","readOnly":true},{"mountPath":"/var/lib/kube-controller-manager","name":"var-lib-kube-controller-manager","readOnly":true},{"mountPath":"/etc/systemd","name":"etc-systemd","readOnly":true},{"mountPath":"/lib/systemd/","name":"lib-systemd","readOnly":true},{"mountPath":"/etc/kubernetes","name":"etc-kubernetes","readOnly":true},{"mountPath":"/etc/cni/net.d/","name":"etc-cni-netd","readOnly":true}]' nodeCollector.volumes: '[{"hostPath":{"path":"/var/lib/etcd"},"name":"var-lib-etcd"},{"hostPath":{"path":"/var/lib/kubelet"},"name":"var-lib-kubelet"},{"hostPath":{"path":"/var/lib/kube-scheduler"},"name":"var-lib-kube-scheduler"},{"hostPath":{"path":"/var/lib/kube-controller-manager"},"name":"var-lib-kube-controller-manager"},{"hostPath":{"path":"/etc/systemd"},"name":"etc-systemd"},{"hostPath":{"path":"/lib/systemd"},"name":"lib-systemd"},{"hostPath":{"path":"/etc/kubernetes"},"name":"etc-kubernetes"},{"hostPath":{"path":"/etc/cni/net.d/"},"name":"etc-cni-netd"}]' report.recordFailedChecksOnly: "true" scanJob.podTemplateContainerSecurityContext: '{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]},"privileged":false,"readOnlyRootFilesystem":true}' vulnerabilityReports.scanner: Trivy kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: trivy-operator meta.helm.sh/release-namespace: trivy-system creationTimestamp: "2023-12-27T08:46:22Z" labels: app.kubernetes.io/instance: trivy-operator app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: trivy-operator app.kubernetes.io/version: 0.17.1 helm.sh/chart: trivy-operator-0.19.1 name: trivy-operator namespace: trivy-system resourceVersion: "579337565" uid: 457a6483-6f94-43ad-97ed-19fa8140c03b
Env:
- env:
- name: OPERATOR_NAMESPACE
value: trivy-system
- name: OPERATOR_TARGET_NAMESPACES
- name: OPERATOR_EXCLUDE_NAMESPACES
value: default,kube-system
- name: OPERATOR_TARGET_WORKLOADS
value: pod,replicaset,replicationcontroller,statefulset,daemonset,cronjob,job
- name: OPERATOR_SERVICE_ACCOUNT
value: trivy-operator
- name: OPERATOR_LOG_DEV_MODE
value: "true"
- name: OPERATOR_SCAN_JOB_TTL
value: 24h
- name: OPERATOR_SCAN_JOB_TIMEOUT
value: 5m
- name: OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT
value: "20"
- name: OPERATOR_CONCURRENT_NODE_COLLECTOR_LIMIT
value: "1"
- name: OPERATOR_SCAN_JOB_RETRY_AFTER
value: 30s
- name: OPERATOR_BATCH_DELETE_LIMIT
value: "10"
- name: OPERATOR_BATCH_DELETE_DELAY
value: 10s
- name: OPERATOR_METRICS_BIND_ADDRESS
value: :8080
- name: OPERATOR_METRICS_FINDINGS_ENABLED
value: "true"
- name: OPERATOR_METRICS_VULN_ID_ENABLED
value: "true"
- name: OPERATOR_HEALTH_PROBE_BIND_ADDRESS
value: :9090
- name: OPERATOR_VULNERABILITY_SCANNER_ENABLED
value: "true"
- name: OPERATOR_SBOM_GENERATION_ENABLED
value: "false"
- name: OPERATOR_VULNERABILITY_SCANNER_SCAN_ONLY_CURRENT_REVISIONS
value: "true"
- name: OPERATOR_SCANNER_REPORT_TTL
value: 24h
- name: OPERATOR_CACHE_REPORT_TTL
value: 120h
- name: CONTROLLER_CACHE_SYNC_TIMEOUT
value: 5m
- name: OPERATOR_CONFIG_AUDIT_SCANNER_ENABLED
value: "false"
- name: OPERATOR_RBAC_ASSESSMENT_SCANNER_ENABLED
value: "false"
- name: OPERATOR_INFRA_ASSESSMENT_SCANNER_ENABLED
value: "false"
- name: OPERATOR_CONFIG_AUDIT_SCANNER_SCAN_ONLY_CURRENT_REVISIONS
value: "true"
- name: OPERATOR_EXPOSED_SECRET_SCANNER_ENABLED
value: "false"
- name: OPERATOR_METRICS_EXPOSED_SECRET_INFO_ENABLED
value: "false"
- name: OPERATOR_METRICS_CONFIG_AUDIT_INFO_ENABLED
value: "false"
- name: OPERATOR_METRICS_RBAC_ASSESSMENT_INFO_ENABLED
value: "false"
- name: OPERATOR_METRICS_INFRA_ASSESSMENT_INFO_ENABLED
value: "false"
- name: OPERATOR_METRICS_IMAGE_INFO_ENABLED
value: "false"
- name: OPERATOR_METRICS_CLUSTER_COMPLIANCE_INFO_ENABLED
value: "false"
- name: OPERATOR_WEBHOOK_BROADCAST_URL
- name: OPERATOR_WEBHOOK_BROADCAST_TIMEOUT
value: 30s
- name: OPERATOR_SEND_DELETED_REPORTS
value: "false"
- name: OPERATOR_PRIVATE_REGISTRY_SCAN_SECRETS_NAMES
value: '{}'
- name: OPERATOR_ACCESS_GLOBAL_SECRETS_SERVICE_ACCOUNTS
value: "true"
- name: OPERATOR_BUILT_IN_TRIVY_SERVER
value: "false"
- name: TRIVY_SERVER_HEALTH_CHECK_CACHE_EXPIRATION
value: 10h
- name: OPERATOR_MERGE_RBAC_FINDING_WITH_CONFIG_AUDIT
value: "false"
- name: OPERATOR_CLUSTER_COMPLIANCE_ENABLED
value: "true"
@chen-keinan After a few days with this option activated, there are workloads that it now analyses that it did not do before. However, it still doesn't analyse all the workloads, as I have made a parallel script to analyse all the pods with the trivy tool, and it is analysing 101 different images, while trivy-operator is analysing only 23.
Excluding kube-system namespace, which reports a lot of vulns but it's not relevant for us at the moment
@borja-rivera try this workaround for this issue, disable this helm flag. and restart operator
Note: this bug has been fixed in latest v0.18.0-rc
@chen-keinan I'm currently testing on it. Although there are some more reports than before, it's not scanning all of them. I'll let it scanning for few days more and then check any updates.
@borja-rivera thanks , please do let me know if you get any errors in the logs
@borja-rivera issue should be fixed with latest trivy-operator v0.18.2
can you please check and confirm
Closing issue as it should be fixed with latests versions @borja-rivera feel free to open it if you thing that it has not been resolved.
What steps did you take and what happened:
I have a lab cluster with 71 active replicasets and 38 daemonsets and the most I have managed to cover is 58 counting resources of both types. I think it is not powerful enough to cover such a big cluster. I would like to know if you know what are the limitations in terms of cluster size, because in my case, I am trying to test the tool on a lab cluster that does not have too much workloads compared to prod.
Here's an example of logs:
I have installed trivy-operator with the latest helm chart. Here is my values.yaml :
excludeNamespaces: "default" operator: exposedSecretScannerEnabled: false configAuditScannerEnabled: false rbacAssessmentScannerEnabled: false infraAssessmentScannerEnabled: false metricsVulnIdEnabled: true logDevMode: true scanJobTTL: "24h" scanJobsConcurrentLimit: 20 service: metricsPort: 8080 serviceMonitor: enabled: true namespace: trivy-system labels: monitoredBy: monitoring-prometheus trivyOperator: scanJobCompressLogs: false trivy: command: filesystem additionalVulnerabilityReportFields: "Target,Class,PackagePath,PackageType" slow: false resources: requests: cpu: 300m memory: 300M limits: cpu: 700m memory: 700M securityContext: runAsUser: 0
What did you expect to happen: I expect vulnerability reports for all workloads on the cluster but it isn't working as expected
Environment: