aquasecurity / trivy-operator

Kubernetes-native security toolkit
https://aquasecurity.github.io/trivy-operator/latest
Apache License 2.0
1.11k stars 183 forks source link

trivy operator throwing constantly reconcile errors #2137

Open lkaluza-fadi opened 3 weeks ago

lkaluza-fadi commented 3 weeks ago

What steps did you take and what happened:

Upgraded from helm version from 0.23.1 -> 0.23.3

What did you expect to happen:

That everything works smoothly

Anything else you would like to add:

This is the error that we get:

{"level":"error","ts":"2024-06-12T08:42:18Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-785c48587c","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-785c48587c","reconcileID":"624c0d2f-2cdb-4ea3-9d13-052f27ee7e87","error":"illegal base64 data at input byte 6; illegal base64 data at input byte 6","errorCauses":[{"error":"illegal base64 data at input byte 6"},{"error":"illegal base64 data at input byte 6"}],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
{"level":"error","ts":"2024-06-12T08:42:19Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-7cb7c95664","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-7cb7c95664","reconcileID":"5a67a1d8-fc0b-4e90-9991-d09bc2ba55e5","error":"illegal base64 data at input byte 6","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
{"level":"error","ts":"2024-06-12T08:42:50Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-6f777d44b8","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-6f777d44b8","reconcileID":"9dba26aa-115a-4787-8291-5ead70458e94","error":"illegal base64 data at input byte 6","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
chen-keinan commented 2 weeks ago

@lkaluza-fadi Please clean up all scan-jobs and restart operator.

kubectl delete jobs `kubectl get jobs -n trivy-system -o custom-columns=:.metadata.name`
lkaluza-fadi commented 2 weeks ago

@chen-keinan After deleting the jobs, everything seems to be fine, but when the jobs were completed, the reconciliation errors returned.

chen-keinan commented 2 weeks ago

@lkaluza-fadi the is the pod stuck in status completed ?

lkaluza-fadi commented 2 weeks ago

@lkaluza-fadi the is the pod stuck in status completed ?

yes, thats correct.

chen-keinan commented 2 weeks ago

@lkaluza-fadi can you please get it output and sent it (you can send it to me in slack if you do not want to expose it here)

kubectl logs pod <scan-pod-name> -n trivy-system
lkaluza-fadi commented 2 weeks ago

@lkaluza-fadi can you please get it output and sent it (you can send it to me in slack if you do not want to expose it here)

kubectl logs pod <scan-pod-name> -n trivy-system

unfortunately there are no pod logs anymore.

chen-keinan commented 2 weeks ago

@lkaluza-fadi are you able to reproduce it ?

lkaluza-fadi commented 2 weeks ago

@lkaluza-fadi are you able to reproduce it ?

tried to reproduce it, but the logs are gone again

chen-keinan commented 2 weeks ago

is the pod is stuck in completed status ? if so , logs should be there

chen-keinan commented 1 week ago

@lkaluza-fadi do you get any reports ?

lkaluza-fadi commented 1 week ago

@chen-keinan yes, just send it over to you via email. the email that you have in your profil mentioned.

chen-keinan commented 1 week ago

@lkaluza-fadi can you please do another check:

  1. uninstall trivy-operator : helm uninstall trivy-operator -n trivy-system

  2. delete all CRDs:

    kubectl delete crd vulnerabilityreports.aquasecurity.github.io
    kubectl delete crd exposedsecretreports.aquasecurity.github.io
    kubectl delete crd configauditreports.aquasecurity.github.io
    kubectl delete crd clusterconfigauditreports.aquasecurity.github.io
    kubectl delete crd rbacassessmentreports.aquasecurity.github.io
    kubectl delete crd infraassessmentreports.aquasecurity.github.io
    kubectl delete crd clusterrbacassessmentreports.aquasecurity.github.io
    kubectl delete crd clustercompliancereports.aquasecurity.github.io
    kubectl delete crd clusterinfraassessmentreports.aquasecurity.github.io
    kubectl delete crd sbomreports.aquasecurity.github.io
    kubectl delete crd clustersbomreports.aquasecurity.github.io
    kubectl delete crd clustervulnerabilityreports.aquasecurity.github.io
  3. make sure no pods or jobs running in trivy-system namespace

  4. re-install trivy-operator again with helm and set this flag to false

lkaluza-fadi commented 1 week ago

@lkaluza-fadi can you please do another check:

  1. uninstall trivy-operator : helm uninstall trivy-operator -n trivy-system
  2. delete all CRDs:
kubectl delete crd vulnerabilityreports.aquasecurity.github.io
    kubectl delete crd exposedsecretreports.aquasecurity.github.io
    kubectl delete crd configauditreports.aquasecurity.github.io
    kubectl delete crd clusterconfigauditreports.aquasecurity.github.io
    kubectl delete crd rbacassessmentreports.aquasecurity.github.io
    kubectl delete crd infraassessmentreports.aquasecurity.github.io
    kubectl delete crd clusterrbacassessmentreports.aquasecurity.github.io
    kubectl delete crd clustercompliancereports.aquasecurity.github.io
    kubectl delete crd clusterinfraassessmentreports.aquasecurity.github.io
    kubectl delete crd sbomreports.aquasecurity.github.io
    kubectl delete crd clustersbomreports.aquasecurity.github.io
    kubectl delete crd clustervulnerabilityreports.aquasecurity.github.io
  1. make sure no pods or jobs running in trivy-system namespace
  2. re-install trivy-operator again with helm and set this flag to false

done that!

lkaluza-fadi commented 1 week ago

and what changed so far is that the pods for the jobs are now gone after they are done. and for that reason the operator is not logging any reconcile errors any more.

chen-keinan commented 1 week ago

@lkaluza-fadi not sure I understand the question. are you getting reports after the change above ?

lkaluza-fadi commented 1 week ago

@chen-keinan to wrap this up. the reconcile errors are back, but they are now a bit different

{"level":"error","ts":"2024-06-24T10:44:56Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-6f849756bb","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-6f849756bb","reconcileID":"71886afd-c52b-45e5-a36c-b7737c65d5cf","error":"invalid character 'u' looking for beginning of value; invalid character 'u' looking for beginning of value","errorCauses":[{"error":"invalid character 'u' looking for beginning of value"},{"error":"invalid character 'u' looking for beginning of value"}],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
{"level":"error","ts":"2024-06-24T10:47:00Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-65df45bb54","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-65df45bb54","reconcileID":"beaf874f-73ca-473d-875e-ea520c90018b","error":"invalid character 'u' looking for beginning of value","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
{"level":"error","ts":"2024-06-24T10:47:01Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-8448d97cbb","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-8448d97cbb","reconcileID":"dfdd5a68-09c3-45c5-a880-f90bbb0f88cb","error":"invalid character 'u' looking for beginning of value","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}
{"level":"error","ts":"2024-06-24T10:59:04Z","msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-599dbf4488","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-599dbf4488","reconcileID":"8ecb8af1-af3e-4fcb-b2ff-8294f41b7e63","error":"invalid character 'u' looking for beginning of value","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.3/pkg/internal/controller/controller.go:222"}

so back to your question we are getting reports after then changes. but I think we are back to the beginning getting this reconcile error but now in a different flavor!

chen-keinan commented 1 week ago

@lkaluza-fadi I'll be happy to jump-in a zoom call to look at the issue, its very difficult to find what is wrong in your env.

lkaluza-fadi commented 1 week ago

@chen-keinan iam fine with it when does it fit for you?

chen-keinan commented 1 week ago

@lkaluza-fadi find me on slack we can discuss schedule details there

chen-keinan commented 1 week ago

@lkaluza-fadi I mean find me via aqua security slack

lkaluza-fadi commented 1 week ago

@chen-keinan I'm not using slack how do i do so?