aquasecurity / trivy-operator

Kubernetes-native security toolkit
https://aquasecurity.github.io/trivy-operator/latest
Apache License 2.0
1.29k stars 214 forks source link

trivy operator throwing constantly reconcile errors #2137

Open lkaluza-fadi opened 5 months ago

lkaluza-fadi commented 5 months ago

What steps did you take and what happened:

Upgraded from helm version from 0.23.1 -> 0.23.3

What did you expect to happen:

That everything works smoothly

Anything else you would like to add:

This is the error that we get:

{"level":"error","ts":"2024-06-12T08:42:18Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-785c48587c","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-785c48587c","reconcileID":"624c0d2f-2cdb-4ea3-9d13-052f27ee7e87","error":"illegal base64 data at input byte 6; illegal base64 data at input byte 6","errorCauses":[{"error":"illegal base64 data at input byte 6"},{"error":"illegal base64 data at input byte 6"}],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
{"level":"error","ts":"2024-06-12T08:42:19Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-7cb7c95664","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-7cb7c95664","reconcileID":"5a67a1d8-fc0b-4e90-9991-d09bc2ba55e5","error":"illegal base64 data at input byte 6","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
{"level":"error","ts":"2024-06-12T08:42:50Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-6f777d44b8","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-6f777d44b8","reconcileID":"9dba26aa-115a-4787-8291-5ead70458e94","error":"illegal base64 data at input byte 6","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
chen-keinan commented 5 months ago

@lkaluza-fadi Please clean up all scan-jobs and restart operator.

kubectl delete jobs `kubectl get jobs -n trivy-system -o custom-columns=:.metadata.name`
lkaluza-fadi commented 5 months ago

@chen-keinan After deleting the jobs, everything seems to be fine, but when the jobs were completed, the reconciliation errors returned.

chen-keinan commented 5 months ago

@lkaluza-fadi the is the pod stuck in status completed ?

lkaluza-fadi commented 5 months ago

@lkaluza-fadi the is the pod stuck in status completed ?

yes, thats correct.

chen-keinan commented 5 months ago

@lkaluza-fadi can you please get it output and sent it (you can send it to me in slack if you do not want to expose it here)

kubectl logs pod <scan-pod-name> -n trivy-system
lkaluza-fadi commented 5 months ago

@lkaluza-fadi can you please get it output and sent it (you can send it to me in slack if you do not want to expose it here)

kubectl logs pod <scan-pod-name> -n trivy-system

unfortunately there are no pod logs anymore.

chen-keinan commented 5 months ago

@lkaluza-fadi are you able to reproduce it ?

lkaluza-fadi commented 5 months ago

@lkaluza-fadi are you able to reproduce it ?

tried to reproduce it, but the logs are gone again

chen-keinan commented 5 months ago

is the pod is stuck in completed status ? if so , logs should be there

chen-keinan commented 5 months ago

@lkaluza-fadi do you get any reports ?

lkaluza-fadi commented 5 months ago

@chen-keinan yes, just send it over to you via email. the email that you have in your profil mentioned.

chen-keinan commented 5 months ago

@lkaluza-fadi can you please do another check:

  1. uninstall trivy-operator : helm uninstall trivy-operator -n trivy-system

  2. delete all CRDs:

    kubectl delete crd vulnerabilityreports.aquasecurity.github.io
    kubectl delete crd exposedsecretreports.aquasecurity.github.io
    kubectl delete crd configauditreports.aquasecurity.github.io
    kubectl delete crd clusterconfigauditreports.aquasecurity.github.io
    kubectl delete crd rbacassessmentreports.aquasecurity.github.io
    kubectl delete crd infraassessmentreports.aquasecurity.github.io
    kubectl delete crd clusterrbacassessmentreports.aquasecurity.github.io
    kubectl delete crd clustercompliancereports.aquasecurity.github.io
    kubectl delete crd clusterinfraassessmentreports.aquasecurity.github.io
    kubectl delete crd sbomreports.aquasecurity.github.io
    kubectl delete crd clustersbomreports.aquasecurity.github.io
    kubectl delete crd clustervulnerabilityreports.aquasecurity.github.io
  3. make sure no pods or jobs running in trivy-system namespace

  4. re-install trivy-operator again with helm and set this flag to false

lkaluza-fadi commented 5 months ago

@lkaluza-fadi can you please do another check:

  1. uninstall trivy-operator : helm uninstall trivy-operator -n trivy-system
  2. delete all CRDs:
kubectl delete crd vulnerabilityreports.aquasecurity.github.io
    kubectl delete crd exposedsecretreports.aquasecurity.github.io
    kubectl delete crd configauditreports.aquasecurity.github.io
    kubectl delete crd clusterconfigauditreports.aquasecurity.github.io
    kubectl delete crd rbacassessmentreports.aquasecurity.github.io
    kubectl delete crd infraassessmentreports.aquasecurity.github.io
    kubectl delete crd clusterrbacassessmentreports.aquasecurity.github.io
    kubectl delete crd clustercompliancereports.aquasecurity.github.io
    kubectl delete crd clusterinfraassessmentreports.aquasecurity.github.io
    kubectl delete crd sbomreports.aquasecurity.github.io
    kubectl delete crd clustersbomreports.aquasecurity.github.io
    kubectl delete crd clustervulnerabilityreports.aquasecurity.github.io
  1. make sure no pods or jobs running in trivy-system namespace
  2. re-install trivy-operator again with helm and set this flag to false

done that!

lkaluza-fadi commented 5 months ago

and what changed so far is that the pods for the jobs are now gone after they are done. and for that reason the operator is not logging any reconcile errors any more.

chen-keinan commented 5 months ago

@lkaluza-fadi not sure I understand the question. are you getting reports after the change above ?

lkaluza-fadi commented 5 months ago

@chen-keinan to wrap this up. the reconcile errors are back, but they are now a bit different

{"level":"error","ts":"2024-06-24T10:44:56Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-6f849756bb","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-6f849756bb","reconcileID":"71886afd-c52b-45e5-a36c-b7737c65d5cf","error":"invalid character 'u' looking for beginning of value; invalid character 'u' looking for beginning of value","errorCauses":[{"error":"invalid character 'u' looking for beginning of value"},{"error":"invalid character 'u' looking for beginning of value"}],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
{"level":"error","ts":"2024-06-24T10:47:00Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-65df45bb54","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-65df45bb54","reconcileID":"beaf874f-73ca-473d-875e-ea520c90018b","error":"invalid character 'u' looking for beginning of value","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
{"level":"error","ts":"2024-06-24T10:47:01Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-8448d97cbb","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-8448d97cbb","reconcileID":"dfdd5a68-09c3-45c5-a880-f90bbb0f88cb","error":"invalid character 'u' looking for beginning of value","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
{"level":"error","ts":"2024-06-24T10:59:04Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-599dbf4488","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-599dbf4488","reconcileID":"8ecb8af1-af3e-4fcb-b2ff-8294f41b7e63","error":"invalid character 'u' looking for beginning of value","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}

so back to your question we are getting reports after then changes. but I think we are back to the beginning getting this reconcile error but now in a different flavor!

chen-keinan commented 5 months ago

@lkaluza-fadi I'll be happy to jump-in a zoom call to look at the issue, its very difficult to find what is wrong in your env.

lkaluza-fadi commented 5 months ago

@chen-keinan iam fine with it when does it fit for you?

chen-keinan commented 5 months ago

@lkaluza-fadi find me on slack we can discuss schedule details there

chen-keinan commented 5 months ago

@lkaluza-fadi I mean find me via aqua security slack

lkaluza-fadi commented 5 months ago

@chen-keinan I'm not using slack how do i do so?

daanschipper commented 4 months ago

This seems related to #1792.

mib93 commented 4 months ago

Hi, I'm facing the same problem: image

is there any solution?

Xeroxxx commented 2 months ago

My cluster starting to have the same issue. Already reinstalled trviy-operator.

EDIT: Running on 1.31. Kubernetes SuccessPolicy changed. https://github.com/aquasecurity/trivy-operator/issues/2251

benni-as commented 2 months ago

Same error:

{
  "level": "error",
  "ts": "2024-09-05T09:42:22Z",
  "msg": "Reconciler error",
  "controller": "job",
  "controllerGroup": "batch",
  "controllerKind": "Job",
  "Job": {
    "name": "scan-vulnerabilityreport-86c64f59b9",
    "namespace": "trivy-operator"
  },
  "namespace": "trivy-operator",
  "name": "scan-vulnerabilityreport-86c64f59b9",
  "reconcileID": "18547f15-5d01-42ed-b1b4-f208335a0fae",
  "error": "unrecognized scan job condition: SuccessCriteriaMet",
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:222"
}

I am using the lastest helm chart 0.24.1 and I don't see any vulnerability or sbom reports.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been labeled with inactivity.

William-Rom commented 2 weeks ago

I am experiencing the same issue. Disabling the scanJobCompressLogs flag did not help. Version 0.22

Itchimonji commented 1 week ago

Hey, I have the same problem. I am running the Trivy-Operator Helm Chart on Kubernetes Version v1.31.1

Chart Version: "2.5.0"

Helm Values:

trivy-operator:
  log_level: INFO

  serviceMonitor:
    enabled: true

  grafana:
    namespace: prometheus
    dashboards:
      enabled: true
      label: grafana_dashboard
      value: "1"
    folder:
      annotation: k8s-sidecar-target-directory
      name: /tmp/dashboards/site-reliability

  persistence:
    enabled: true
    storageClass: csi-default

  namespaceScanner:
    clusterWide: true
    integrations:
      policyreport: false

  clusterScanner:
    enabled: true
    crontab: "*/1 * * * *"

  trivy:
    ignoreUnfixed: true
  operator:
    metricsVulnIdEnabled: true

On of the error mesasges:

{"level":"error","ts":"2024-11-18T06:02:19Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-77967db879","namespace":"trivy-operator"},"namespace":"trivy-operator","name":"scan-vulnerabilityreport-77967db879","reconcileID":"910004aa-b41b-4196-8e8e-b58f0353be2d","error":"unrecognized scan job condition: SuccessCriteriaMet","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:235"}
afdesk commented 1 week ago

Hey, I have the same problem. I am running the Trivy-Operator Helm Chart on Kubernetes Version v1.31.1

On of the error mesasges:

{"level":"error","ts":"2024-11-18T06:02:19Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-77967db879","namespace":"trivy-operator"},"namespace":"trivy-operator","name":"scan-vulnerabilityreport-77967db879","reconcileID":"910004aa-b41b-4196-8e8e-b58f0353be2d","error":"unrecognized scan job condition: SuccessCriteriaMet","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:235"}

Thanks for the report! Unfortunately, It's a known issue, you can track it #2251