Closed sidpalas closed 4 months ago
I enabled log dev mode (--set operator.logDevMode=true
in my helm install command) and see the following error in the operator pod logs:
2024-06-04T01:26:17Z ERROR reconciler.scan job Scan job container {"job": "trivy-system/scan-vulnerabilityreport-7b5bf85fb9", "container": "load-generator", "status.reason": "Error", "status.message": ""}
github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).completedContainers
/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:353
github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).SetupWithManager.(*ScanJobController).reconcileJobs.func1
/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:80
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/reconcile/reconcile.go:113
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:222
2024-06-04T01:26:17Z DEBUG reconciler.scan job Job complete {"job": "trivy-system/scan-vulnerabilityreport-7b5bf85fb9", "kind": "ReplicaSet", "name": "load-generator-python-76468d599c", "namespace": "demo-app", "podSpecHash": "846497f7b8"}
2024-06-04T01:26:17Z DEBUG reconciler.scan job VulnerabilityReports already exist {"job": "trivy-system/scan-vulnerabilityreport-7b5bf85fb9", "kind": "ReplicaSet", "name": "load-generator-python-76468d599c", "namespace": "demo-app", "podSpecHash": "846497f7b8", "owner": {"apiVersion": "apps/v1", "kind": "ReplicaSet", "namespace": "demo-app", "name": "load-generator-python-76468d599c"}}
2024-06-04T01:26:17Z DEBUG reconciler.scan job Deleting complete scan job {"job": "trivy-system/scan-vulnerabilityreport-7b5bf85fb9", "kind": "ReplicaSet", "name": "load-generator-python-76468d599c", "namespace": "demo-app", "podSpecHash": "846497f7b8", "owner": {"apiVersion": "apps/v1", "kind": "ReplicaSet", "namespace": "demo-app", "name": "load-generator-python-76468d599c"}}
Interesting that it is claiming a VulnerabilityReports already exist
🤔
Looking at the generated k8s scan job I also don't see a TRIVY_USERNAME
or TRIVY_PASSWORD
environment variable like I would expect based on: https://github.com/aquasecurity/trivy-operator/blob/b4bab3568480c5cbb3393dec5542ee694cf4263f/pkg/plugins/trivy/image.go#L208-L239
containers:
- args:
- -c
- trivy image --slow 'docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz'
--scanners vuln,secret --image-config-scanners secret --skip-db-update --cache-dir
/tmp/trivy/.cache --quiet --list-all-pkgs --format json > /tmp/scan/result_load-generator.json
&& bzip2 -c /tmp/scan/result_load-generator.json | base64
command:
- /bin/sh
env:
- name: TRIVY_SEVERITY
valueFrom:
configMapKeyRef:
key: trivy.severity
name: trivy-operator-trivy-config
optional: true
- name: TRIVY_IGNORE_UNFIXED
valueFrom:
configMapKeyRef:
key: trivy.ignoreUnfixed
name: trivy-operator-trivy-config
optional: true
- name: TRIVY_OFFLINE_SCAN
valueFrom:
configMapKeyRef:
key: trivy.offlineScan
name: trivy-operator-trivy-config
optional: true
- name: TRIVY_JAVA_DB_REPOSITORY
valueFrom:
configMapKeyRef:
key: trivy.javaDbRepository
name: trivy-operator-trivy-config
optional: true
- name: TRIVY_TIMEOUT
valueFrom:
configMapKeyRef:
key: trivy.timeout
name: trivy-operator-trivy-config
optional: true
- name: TRIVY_SKIP_FILES
valueFrom:
configMapKeyRef:
key: trivy.skipFiles
name: trivy-operator-trivy-config
optional: true
- name: TRIVY_SKIP_DIRS
valueFrom:
configMapKeyRef:
key: trivy.skipDirs
name: trivy-operator-trivy-config
optional: true
- name: HTTP_PROXY
valueFrom:
configMapKeyRef:
key: trivy.httpProxy
name: trivy-operator-trivy-config
optional: true
- name: HTTPS_PROXY
valueFrom:
configMapKeyRef:
key: trivy.httpsProxy
name: trivy-operator-trivy-config
optional: true
- name: NO_PROXY
valueFrom:
configMapKeyRef:
key: trivy.noProxy
name: trivy-operator-trivy-config
optional: true
image: ghcr.io/aquasecurity/trivy:0.51.2
imagePullPolicy: IfNotPresent
name: load-generator
resources:
limits:
cpu: 500m
memory: 500M
requests:
cpu: 100m
memory: 100M
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
readOnlyRootFilesystem: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /tmp
name: tmp
- mountPath: /tmp/scan
name: scanresult
@sidpalas I'm cutting v0.21.2 today to include more logs around the root cause. please test with it again once it released, so we could have more context.
Thanks @chen-keinan 🙏
I just upgraded and now see the following ERROR in the trivy-operator pod logs:
2024-06-04T10:41:32Z ERROR reconciler.scan job Scan job container {"job": "trivy-system/scan-vulnerabilityreport-7b5bf85fb9", "container": "load-generator", "status.reason": "Error", "status.message": "2024-06-04T10:41:29Z\tFATAL\tFatal error\timage scan error: scan error: unable to initialize a scanner: unable to initialize an image scanner: unable to find the specified image \"docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz\" in [\"docker\" \"containerd\" \"podman\" \"remote\"]: 4 errors occurred:\n\t* docker error: unable to inspect the image (docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\n\t* containerd error: containerd socket not found: /run/containerd/containerd.sock\n\t* podman error: unable to initialize Podman client: no podman socket found: stat podman/podman.sock: no such file or directory\n\t* remote error: GET https://index.docker.io/v2/sidpalas/devops-directive-kubernetes-course-load-generator-python/manifests/foobarbaz: UNAUTHORIZED: authentication required; [map[Action:pull Class: Name:sidpalas/devops-directive-kubernetes-course-load-generator-python Type:repository]]\n\n\n"}
Could it be related to the fact that I am running this on Civo cloud with k3s. Does Trivy assume a particular container runtime? (the error appears to try docker, then containerd, then podman).
EDIT: I just tried the same configuration on a GKE cluster and got the same error:
2024-06-04T11:00:53Z ERROR reconciler.scan job Scan job container {"job": "trivy-system/scan-vulnerabilityreport-7b5bf85fb9", "container": "load-generator", "status.reason": "Error", "status.message": "2024-06-04T11:00:45Z\tFATAL\tFatal error\timage scan error: scan error: unable to initialize a scanner: unable to initialize an image scanner: unable to find the specified image \"docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz\" in [\"docker\" \"containerd\" \"podman\" \"remote\"]: 4 errors occurred:\n\t* docker error: unable to inspect the image (docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\n\t* containerd error: containerd socket not found: /run/containerd/containerd.sock\n\t* podman error: unable to initialize Podman client: no podman socket found: stat podman/podman.sock: no such file or directory\n\t* remote error: GET https://index.docker.io/v2/sidpalas/devops-directive-kubernetes-course-load-generator-python/manifests/foobarbaz: UNAUTHORIZED: authentication required; [map[Action:pull Class: Name:sidpalas/devops-directive-kubernetes-course-load-generator-python Type:repository]]\n\n\n"}
@sidpalas have you followed all our options for private registries
@sidpalas have you followed all our options for private registries
I am using a kubernetes.io/dockerconfigjson imagePullSecret referenced in the pod spec.
I have not tried filesystem scanning or an imagePullSecret referenced by a ServiceAccount.
t
Looking at the generated k8s scan job I also don't see a
TRIVY_USERNAME
orTRIVY_PASSWORD
environment variable like I would expect based on:containers: - args: - -c - trivy image --slow 'docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz' --scanners vuln,secret --image-config-scanners secret --skip-db-update --cache-dir /tmp/trivy/.cache --quiet --list-all-pkgs --format json > /tmp/scan/result_load-generator.json && bzip2 -c /tmp/scan/result_load-generator.json | base64 command: - /bin/sh env: - name: TRIVY_SEVERITY valueFrom: configMapKeyRef: key: trivy.severity name: trivy-operator-trivy-config optional: true - name: TRIVY_IGNORE_UNFIXED valueFrom: configMapKeyRef: key: trivy.ignoreUnfixed name: trivy-operator-trivy-config optional: true - name: TRIVY_OFFLINE_SCAN valueFrom: configMapKeyRef: key: trivy.offlineScan name: trivy-operator-trivy-config optional: true - name: TRIVY_JAVA_DB_REPOSITORY valueFrom: configMapKeyRef: key: trivy.javaDbRepository name: trivy-operator-trivy-config optional: true - name: TRIVY_TIMEOUT valueFrom: configMapKeyRef: key: trivy.timeout name: trivy-operator-trivy-config optional: true - name: TRIVY_SKIP_FILES valueFrom: configMapKeyRef: key: trivy.skipFiles name: trivy-operator-trivy-config optional: true - name: TRIVY_SKIP_DIRS valueFrom: configMapKeyRef: key: trivy.skipDirs name: trivy-operator-trivy-config optional: true - name: HTTP_PROXY valueFrom: configMapKeyRef: key: trivy.httpProxy name: trivy-operator-trivy-config optional: true - name: HTTPS_PROXY valueFrom: configMapKeyRef: key: trivy.httpsProxy name: trivy-operator-trivy-config optional: true - name: NO_PROXY valueFrom: configMapKeyRef: key: trivy.noProxy name: trivy-operator-trivy-config optional: true image: ghcr.io/aquasecurity/trivy:0.51.2 imagePullPolicy: IfNotPresent name: load-generator resources: limits: cpu: 500m memory: 500M requests: cpu: 100m memory: 100M securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL privileged: false readOnlyRootFilesystem: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /tmp name: tmp - mountPath: /tmp/scan name: scanresult
you mean you do not see the envvar mapped to secret ? can you please share how you create the secret. can you write here an example of the command ?
you mean you do not see the envvar mapped to secret ?
Yes, based on the docs and the code in (trivy-operator/pkg/plugins/trivy/image.go I expected trivy to generate a secret in the trivy-system
namespace and then use that to set the TRIVY_USERNAME
or TRIVY_PASSWORD
env vars in the job it creates to do the scanning. Is that generally correct? (I could be misunderstanding the code).
The command I used to create my image pull secret (in the namespace of the application) was:
kubectl create secret -n demo-app docker-registry dockerconfigjson \
--docker-email=${DOCKER_EMAIL} \
--docker-username=${DOCKER_USERNAME} \
--docker-password=${DOCKER_PASSWORD} \
--docker-server=docker.io
I am confident the imagePullSecret is correct because the pods corresponding to the deployment with the private image are able to pull the image successfully.
Does that provide the info you were looking for?
@sidpalas I see that you created the secret in demo-app
namespace but the workload descriptor example you put above is deployed to default
namespace
Sorry, I stripped out the namespace (and other things such as env vars) from the deployment in my initial issue since it didn't seem relevant at the time. The deployment and the image pull secret are both in the demo-app namespace.
@sidpalas you mean that you have confirm that it do not work when both (secret and deployment) are on the same namespace?
note: you can use the : Fourth Option: Define Secrets through Trivy-Operator configuration
on the doc and specified secret in a different namespace
Correct. The setup I have is:
The resulting behavior im seeing:
I did try the fourth option (specifying operator.privateRegistryScanSecretNames
in my helm values file referencing my image pull secret) but the scan job still failed.
@sidpalas strange trivy-operator
has e2e tests for private registry.
any additional info you can share on your env. (on prem. cloud or else) which can help ?
It is strange!
My setup is pretty plain. I initially was using a managed cluster on Civo cloud, but tested with GKE and saw the same result.
I took a look at the end to end test you linked. The pod spec (https://github.com/aquasecurity/trivy-operator/blob/b83c17837c80f043b86321cf056ee17ada4a08b7/tests/e2e/image-private-registries-config/workload/00-pod.yaml) doesn't actually contain an imagePullSecret. Should it?
@sidpalas I'm happy to debug it over a call with your. find me on trivy-operator
channel on aqua security slack and we could schedule something if you want of course
Thank you for the offer! (and for all the time spent helping me so far!).
I'll distill my example into its simplest form (essentially duplicating the e2e test but with my own private repo) and see if I can get it working. If not, I'll hop into the aquasec slack! 🙏
Thank you for the offer! (and for all the time spent helping me so far!).
I'll distill my example into its simplest form (essentially duplicating the e2e test but with my own private repo) and see if I can get it working. If not, I'll hop into the aquasec slack! 🙏
sure feel free to ping me, at any case I'll give it another test try for private registry manually
Was able to replicate the failure in a KinD cluster (v1.30.0
):
Secret created with:
kubectl create secret docker-registry artcred \
--docker-email=${EMAIL} \
--docker-username=${USERNAME} \
--docker-password=${PASSWORD} \
--docker-server=docker.io
Trivy installed with:
helm upgrade --install trivy-operator aqua/trivy-operator \
--namespace trivy-system \
--create-namespace \
--version 0.23.2
Test pods:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mycontainer
image: docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz
imagePullPolicy: Always
command: ["/bin/sh"]
args:
- -c
- sleep 9999
imagePullSecrets:
- name: artcred
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: private-reg
imagePullSecrets:
- name: artcred
---
apiVersion: v1
kind: Pod
metadata:
name: mypod-sa
spec:
containers:
- name: mycontainer
image: docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz
imagePullPolicy: Always
command: ["/bin/sh"]
args:
- -c
- sleep 9999
serviceAccountName: private-reg
---
As before, the pods using the private image, successfully pull and run it, but the trivy scan jobs fail.
@sidpalas Thanks for update I will run a quick test manually tomorrow and update you
Thank you!
Also, just as another data point:
Initially I was using a token with readonly permissions on dockerhub, but created a new one with read/write permissions. Both had the same failed result.
To further reduce the delta between the e2e test setup and mine I also tested with:
I'm still seeing the same result though. Really stumped as to what could be the root cause here... 🤔
Ah, I think the issue was with my --docker-server
in the imagepullsecret!
I had used:
--docker-server=docker.io
But after changing it to:
--docker-server=https://index.docker.io/v1
The jobs succeeded and the vulnerability scan reports show up!
Interesting that it works fine with docker.io
for k8s pulling the image. Presumably Trivy was being stricter about the requirement for it to be a fully qualified domain name and discarding my secret as invalid.
This is likely the culprit!
GetServerFromDockerAuthKey("docker.io") => docker.io
vs.
GetServerFromDockerAuthKey("https://index.docker.io/v1") => index.docker.io
It looks like this is something that may be special cased in upstream kubernetes so that docker.io
properly resolves to index.docker.io
😳: https://github.com/kubernetes/kubernetes/blob/9e2075b3c87061d25759b0ad112266f03601afd8/pkg/credentialprovider/keyring.go#L128-L158
@chen-keinan -- Do you think its worth special casing in trivy-operator so the behavior matches Kubernetes (or adding a warning to the docs)?
@sidpalas thanks for the update. yes I think it worth updating docs or supporting in the code to avoid experience issues in future. feel free to raise a PR if you have time
@sidpalas do you want to keep this issue open ?
@sidpalas do you want to keep this issue open ?
I'd say we can either close it or I can update the title/description to reflect the more narrow scope of the issue (I.e. only impacting image pull secrets using the shorthand registry name for docker.io)
Do you have a preference? I'm fine with either!
Thanks for update. Closing the issue
What steps did you take and what happened:
Install latest version of helm chart:
Deploy a workload using a private image from dockerhub:
The Deployment successfully uses the imagePullSecret, but the corresponding scan job for private image fails.
What did you expect to happen:
Trivy to use imagePullSecret as described here:
Anything else you would like to add:
vulnerabilityReports.scanJobsInSameNamespace: true
but the job still failedEnvironment:
trivy-operator version
): ghcr.io/aquasecurity/trivy-operator:0.21.1kubectl version
):v1.28.7+k3s1