aquasecurity / trivy-operator

Kubernetes-native security toolkit
https://aquasecurity.github.io/trivy-operator/latest
Apache License 2.0
1.12k stars 191 forks source link

Jobs Complete But Pods are Left Behind #1518

Open jicunningham opened 10 months ago

jicunningham commented 10 months ago

What steps did you take and what happened:

Often our vulnerability scanning pods will complete successfully and just stay in the cluster, not removing themselves like they are supposed to. When we have these scanners laying around it sometimes prevents new scans from happening because we have it set to run 3 at a time so as not to put too much pressure on the cluster itself. With the pods left over it will prevent further scanning because the config sees that they exist already.

What did you expect to happen: We expected the pods to run and then terminate Anything else you would like to add:

Here are some logs:

> kubectl describe po scan-vulnerabilityreport-75b7686948-jhbsl -n trivy-system
Name:         scan-vulnerabilityreport-75b7686948-jhbsl
Namespace:    trivy-system
Priority:     0
Node:         <node>/10.132.0.11
Start Time:   Fri, 15 Sep 2023 09:40:25 -0400
Labels:       app.kubernetes.io/managed-by=trivy-operator
              controller-uid=dc0aa08a-56b7-4a25-b691-ee783d258229
              job-name=scan-vulnerabilityreport-75b7686948
              resource-spec-hash=59f6d457db
              trivy-operator.resource.kind=ReplicaSet
              trivy-operator.resource.name=ingressgateway-7575b4cc64
              trivy-operator.resource.namespace=istio-system
              vulnerabilityReport.scanner=Trivy
Annotations:  cni.projectcalico.org/containerID: 5af2302ca2e01ba365953261bf4509a1b027d7e0b1eac57c2d21412c1eae4e83
              cni.projectcalico.org/podIP: 
              cni.projectcalico.org/podIPs: 
Status:       Succeeded
IP:           10.97.15.186
IPs:
  IP:           10.97.15.186
Controlled By:  Job/scan-vulnerabilityreport-75b7686948
Init Containers:
  f0d8315b-2717-40f2-9b50-742e0e833269:
    Container ID:  containerd://4a5051cb5dba683881595573cd01e0fe52c5dfa70c1a37ebc0b84053fbbd26b0
    Image:         ghcr.io/aquasecurity/trivy:0.42.0
    Image ID:      ghcr.io/aquasecurity/trivy@sha256:b75725d2c11ff54a5fe23f6e8b9a8c6177b8bf5221f08697cf0eed43442b1bfa
    Port:          <none>
    Host Port:     <none>
    Command:
      trivy
    Args:
      --cache-dir
      /tmp/trivy/.cache
      image
      --download-db-only
      --db-repository
      ghcr.io/aquasecurity/trivy-db
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 15 Sep 2023 09:40:26 -0400
      Finished:     Fri, 15 Sep 2023 09:40:31 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     750m
      memory:  750M
    Requests:
      cpu:     256m
      memory:  256M
    Environment:
      HTTP_PROXY:    <set to the key 'trivy.httpProxy' of config map 'trivy-operator-trivy-config'>   Optional: true
      HTTPS_PROXY:   <set to the key 'trivy.httpsProxy' of config map 'trivy-operator-trivy-config'>  Optional: true
      NO_PROXY:      <set to the key 'trivy.noProxy' of config map 'trivy-operator-trivy-config'>     Optional: true
      GITHUB_TOKEN:  <set to the key 'trivy.githubToken' in secret 'trivy-operator-trivy-config'>     Optional: true
    Mounts:
      /tmp from tmp (rw)
Containers:
  istio-proxy:
    Container ID:  containerd://cc59454fe9f3c4b56a6b7692a7ca148498179331e2563efeca9e27b39eee6563
    Image:         ghcr.io/aquasecurity/trivy:0.42.0
    Image ID:      ghcr.io/aquasecurity/trivy@sha256:b75725d2c11ff54a5fe23f6e8b9a8c6177b8bf5221f08697cf0eed43442b1bfa
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      trivy image --slow 'docker.io/istio/proxyv2:1.16.2' --scanners vuln   --skip-db-update --cache-dir /tmp/trivy/.cache --quiet --list-all-pkgs --format json > /tmp/scan/result_istio-proxy.json &&  bzip2 -c /tmp/scan/result_istio-proxy.json | base64
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 15 Sep 2023 09:40:34 -0400
      Finished:     Fri, 15 Sep 2023 09:40:46 -0400
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     750m
      memory:  750M
    Requests:
      cpu:     256m
      memory:  256M
    Environment:
      TRIVY_SEVERITY:            <set to the key 'trivy.severity' of config map 'trivy-operator-trivy-config'>          Optional: true
      TRIVY_IGNORE_UNFIXED:      <set to the key 'trivy.ignoreUnfixed' of config map 'trivy-operator-trivy-config'>     Optional: true
      TRIVY_OFFLINE_SCAN:        <set to the key 'trivy.offlineScan' of config map 'trivy-operator-trivy-config'>       Optional: true
      TRIVY_JAVA_DB_REPOSITORY:  <set to the key 'trivy.javaDbRepository' of config map 'trivy-operator-trivy-config'>  Optional: true
      TRIVY_TIMEOUT:             <set to the key 'trivy.timeout' of config map 'trivy-operator-trivy-config'>           Optional: true
      TRIVY_SKIP_FILES:          <set to the key 'trivy.skipFiles' of config map 'trivy-operator-trivy-config'>         Optional: true
      TRIVY_SKIP_DIRS:           <set to the key 'trivy.skipDirs' of config map 'trivy-operator-trivy-config'>          Optional: true
      HTTP_PROXY:                <set to the key 'trivy.httpProxy' of config map 'trivy-operator-trivy-config'>         Optional: true
      HTTPS_PROXY:               <set to the key 'trivy.httpsProxy' of config map 'trivy-operator-trivy-config'>        Optional: true
      NO_PROXY:                  <set to the key 'trivy.noProxy' of config map 'trivy-operator-trivy-config'>           Optional: true
    Mounts:
      /tmp from tmp (rw)
      /tmp/scan from scanresult (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  scanresult:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

Environment:

chen-keinan commented 9 months ago

@jicunningham can you please share trivy-operator logs ? in addition you can add TTL for jobs to clear it

jicunningham commented 9 months ago

@chen-keinan Here is what I am seeing from the trivy container that did the scanning:

2023-09-30T02:56:46.151Z    INFO    Need to update DB
2023-09-30T02:56:46.151Z    INFO    DB Repository: ghcr.io/aquasecurity/trivy-db
2023-09-30T02:56:46.151Z    INFO    Downloading DB...
20.28 MiB / 40.07 MiB [------------------------------>______________________________] 50.61% ? p/s ?37.78 MiB / 40.07 MiB [--------------------------------------------------------->___] 94.29% ? p/s ?40.07 MiB / 40.07 MiB [----------------------------------------------------------->] 100.00% ? p/s ?40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 32.95 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 32.95 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 32.95 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 30.83 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 30.83 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 30.83 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 28.84 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 28.84 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 28.84 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 26.98 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 26.98 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 26.98 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [-------------------------------------------------] 100.00% 14.25 MiB p/s 3.0s

As for the logs of the other containers (the ones scanned) all there exists is what looks like a large public/private/cert text block.

Currently, the TTL is set to 24 hours. Here is the status of some of them: image

chen-keinan commented 9 months ago

@chen-keinan Here is what I am seeing from the trivy container that did the scanning:

2023-09-30T02:56:46.151Z  INFO    Need to update DB
2023-09-30T02:56:46.151Z  INFO    DB Repository: ghcr.io/aquasecurity/trivy-db
2023-09-30T02:56:46.151Z  INFO    Downloading DB...
20.28 MiB / 40.07 MiB [------------------------------>______________________________] 50.61% ? p/s ?37.78 MiB / 40.07 MiB [--------------------------------------------------------->___] 94.29% ? p/s ?40.07 MiB / 40.07 MiB [----------------------------------------------------------->] 100.00% ? p/s ?40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 32.95 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 32.95 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 32.95 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 30.83 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 30.83 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 30.83 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 28.84 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 28.84 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 28.84 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 26.98 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 26.98 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [---------------------------------------------->] 100.00% 26.98 MiB p/s ETA 0s40.07 MiB / 40.07 MiB [-------------------------------------------------] 100.00% 14.25 MiB p/s 3.0s

As for the logs of the other containers (the ones scanned) all there exists is what looks like a large public/private/cert text block.

Currently, the TTL is set to 24 hours. Here is the status of some of them: image

just to clarify the TTL flag I mention above is for jobs not report

github-actions[bot] commented 7 months ago

This issue is stale because it has been labeled with inactivity.

jicunningham commented 4 months ago

@chen-keinan can this be reopened? Was there a solution?

chen-keinan commented 4 months ago

@chen-keinan can this be reopened? Was there a solution?

try setting scanJobTTL param

github-actions[bot] commented 2 months ago

This issue is stale because it has been labeled with inactivity.

FranAguiar commented 1 month ago

Exists something similar for node-collector job? The node collector job finish succesfully, but the pod and the job remains in the cluster in complete state.

I'm using the helm chart, version 0.23.3

chen-keinan commented 1 month ago

@FranAguiar do you see any errors in trivy-operator log ?

FranAguiar commented 1 month ago

Hello @chen-keinan, yes, there is an error in the operator, I did not check it because the job was complete and I though everything was ok. This is the error:

{"level":"error","ts":"2024-06-13T09:08:41Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-6f7db594d8","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-6f7db594d8","reconcileID":"1ae4b06b-6299-461c-adfc-1e4c44ccbb85","error":"unexpected end of JSON input","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
chen-keinan commented 1 month ago

@FranAguiar can you please delete the node-collector job and restart trivy-operator

FranAguiar commented 1 month ago

Done, there is another error, specific for the node collector:

{"level":"error","ts":"2024-06-13T09:26:41Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"node-collector-765bcb57b","namespace":"trivy-system"},"namespace":"trivy-system","name":"node-collector-765bcb57b","reconcileID":"0f9aec59-4e0b-45bf-bef3-ca37e608a961","error":"failed to evaluate policies on Node : failed to run policy checks on resources","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}

failed to evaluate policies on Node : failed to run policy checks on resources

chen-keinan commented 1 month ago

@FranAguiar can you please share your configmaps ?

FranAguiar commented 1 month ago

Sure:

chen-keinan commented 1 month ago

can you do a quick test and switch the following params value to:

  trivy.useBuiltinRegoPolicies: "false"
  trivy.useEmbeddedRegoPolicies: "true"

and let me know if you get an error ?

FranAguiar commented 1 month ago

Same error

{"level":"error","ts":"2024-06-13T10:06:18Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"node-collector-6496488658","namespace":"trivy-system"},"namespace":"trivy-system","name":"node-collector-6496488658","reconcileID":"65cb6c3c-3df5-419f-a5c7-b9444bdf5b3c","error":"failed to evaluate policies on Node : failed to run policy checks on resources","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
chen-keinan commented 1 month ago

strange, config looks ok and should work with both config ways

FranAguiar commented 1 month ago

I use the chart, version 0.23.3. Installed with terraform. My settings below

namespace     = "trivy-system"
chart_name    = "trivy-operator"
chart_version = "0.23.3"
repository    = "https://aquasecurity.github.io/helm-charts"
chart_values = [
  {
    name  = "serviceMonitor.enabled",
    value = true
  },
  {
    name  = "trivy.ignoreUnfixed",
    value = true
  },
  {
    name  = "operator.scanJobTTL",
    value = "1m"
  },
  {
    name = "operator.metricsVulnIdEnabled"
    value = true
  },
  {
    name = "operator.metricsExposedSecretInfo"
    value = true
  },
  {
    name = "operator.metricsConfigAuditInfo"
    value = true
  },
  {
    name = "operator.metricsRbacAssessmentInfo"
    value = true
  },
]

Some how it is working, I have the reports in prometheus/grafana, the only issue is with the node-collector pod

chen-keinan commented 1 month ago

sure, it is not depend, you are missing clusterInfraAssessment reports and compliance reports because of it

FranAguiar commented 1 month ago

I tried both, but I will start with those for now. Maybe the pod job is an issue in this version, Do you think it worth try an older version?

chen-keinan commented 1 month ago

I do not think so, as we have tests who check it and release has passed.

one think I could suggest is to completely delete trivy-operator include crds and re-install is again

helm uninstall trivy-operator -n  trivy-system
    kubectl delete crd vulnerabilityreports.aquasecurity.github.io
    kubectl delete crd exposedsecretreports.aquasecurity.github.io
    kubectl delete crd configauditreports.aquasecurity.github.io
    kubectl delete crd clusterconfigauditreports.aquasecurity.github.io
    kubectl delete crd rbacassessmentreports.aquasecurity.github.io
    kubectl delete crd infraassessmentreports.aquasecurity.github.io
    kubectl delete crd clusterrbacassessmentreports.aquasecurity.github.io
    kubectl delete crd clustercompliancereports.aquasecurity.github.io
    kubectl delete crd clusterinfraassessmentreports.aquasecurity.github.io
    kubectl delete crd sbomreports.aquasecurity.github.io
    kubectl delete crd clustersbomreports.aquasecurity.github.io
    kubectl delete crd clustervulnerabilityreports.aquasecurity.github.io
FranAguiar commented 1 month ago

Tried that, same behaviour.

I saw that the node collector can be disabled. What does it do? It's optional?

Screenshot 2024-06-13 at 13 29 50

chen-keinan commented 1 month ago

its just a way to assign it to specific node.

if you want to disable node-collector , you can configure this

FranAguiar commented 3 weeks ago

I configured that, but node collector still running and keep forever after complete

chen-keinan commented 3 weeks ago

@FranAguiar you'll have to delete the all node-collector job set the flag and restart trivy-operator

FranAguiar commented 3 weeks ago

Already tried that, the job reappears, run and complete. This what I see in the trivy-operator logs

{"level":"error","ts":"2024-06-19T11:33:22Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-7c49c8f64f","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-7c49c8f64f","reconcileID":"e2ee6e6e-4640-4cff-a810-2dc4ec56911a","error":"unexpected end of JSON input","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}

Helm chart 0.23.3 and GKE Version: v1.30.1-gke.1261000

chen-keinan commented 3 weeks ago

@FranAguiar stange it should not reconcile nodes if InfraAssessmentScannerEnabled is disabled

FranAguiar commented 3 weeks ago

InfraAssessmentScannerEnabled is enabled by default, I disabled it and now the pod and the job do not stay.

chen-keinan commented 3 weeks ago

InfraAssessmentScannerEnabled is enabled by default, I disabled it and now the pod and the job do not stay.

thanks for the update, this is workaround, still need to investigate the root cause in your env.

FranAguiar commented 3 weeks ago

I enabled InfraAssessmentScanner again because all metrics where gone. With the same settings as above I have the below errors in the operator pod

{"level":"error","ts":"2024-06-24T08:26:36Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"node-collector-98d7c6d45","namespace":"trivy-system"},"namespace":"trivy-system","name":"node-collector-98d7c6d45","reconcileID":"78ba1f93-229f-4846-8eb0-f978a7b13d14","error":"failed to evaluate policies on Node : failed to run policy checks on resources","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
{"level":"error","ts":"2024-06-24T07:42:44Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-cf59d7fff","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-cf59d7fff","reconcileID":"143546e1-c0cc-433b-b949-12a490dc09c5","error":"unexpected end of JSON input","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}

Anything else I can provide?

chen-keinan commented 3 weeks ago

which trivy-operator version are you using?

FranAguiar commented 3 weeks ago

Latest helm chart

NAME                                CHART VERSION   APP VERSION
aquasecurity/trivy-operator         0.23.3          0.21.3