aquasecurity / trivy-operator

Kubernetes-native security toolkit
https://aquasecurity.github.io/trivy-operator/latest
Apache License 2.0
1.25k stars 208 forks source link

Vulnerability scan job unable to pull from private dockerhub container registry if secret uses `--docker-server=docker.io` #2120

Closed sidpalas closed 4 months ago

sidpalas commented 4 months ago

What steps did you take and what happened:

Install latest version of helm chart:

        helm upgrade --install trivy-operator aqua/trivy-operator \
          --namespace trivy-system \
          --create-namespace \
          --version 0.23.1

Deploy a workload using a private image from dockerhub:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: load-generator-python
spec:
  replicas: 1
  selector:
    matchLabels:
      app: load-generator-python
  template:
    metadata:
      labels:
        app: load-generator-python
    spec:
      containers:
        - name: load-generator
          image: docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz
      imagePullSecrets:
        - name: dockerconfigjson

The Deployment successfully uses the imagePullSecret, but the corresponding scan job for private image fails.

What did you expect to happen:

Trivy to use imagePullSecret as described here:

Anything else you would like to add:

Environment:

sidpalas commented 4 months ago

I enabled log dev mode (--set operator.logDevMode=true in my helm install command) and see the following error in the operator pod logs:

2024-06-04T01:26:17Z    ERROR   reconciler.scan job     Scan job container      {"job": "trivy-system/scan-vulnerabilityreport-7b5bf85fb9", "container": "load-generator", "status.reason": "Error", "status.message": ""}
github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).completedContainers
        /home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:353
github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).SetupWithManager.(*ScanJobController).reconcileJobs.func1
        /home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:80
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile
        /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/reconcile/reconcile.go:113
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.2/pkg/internal/controller/controller.go:222
2024-06-04T01:26:17Z    DEBUG   reconciler.scan job     Job complete    {"job": "trivy-system/scan-vulnerabilityreport-7b5bf85fb9", "kind": "ReplicaSet", "name": "load-generator-python-76468d599c", "namespace": "demo-app", "podSpecHash": "846497f7b8"}
2024-06-04T01:26:17Z    DEBUG   reconciler.scan job     VulnerabilityReports already exist      {"job": "trivy-system/scan-vulnerabilityreport-7b5bf85fb9", "kind": "ReplicaSet", "name": "load-generator-python-76468d599c", "namespace": "demo-app", "podSpecHash": "846497f7b8", "owner": {"apiVersion": "apps/v1", "kind": "ReplicaSet", "namespace": "demo-app", "name": "load-generator-python-76468d599c"}}
2024-06-04T01:26:17Z    DEBUG   reconciler.scan job     Deleting complete scan job      {"job": "trivy-system/scan-vulnerabilityreport-7b5bf85fb9", "kind": "ReplicaSet", "name": "load-generator-python-76468d599c", "namespace": "demo-app", "podSpecHash": "846497f7b8", "owner": {"apiVersion": "apps/v1", "kind": "ReplicaSet", "namespace": "demo-app", "name": "load-generator-python-76468d599c"}}

Interesting that it is claiming a VulnerabilityReports already exist 🤔

sidpalas commented 4 months ago

Looking at the generated k8s scan job I also don't see a TRIVY_USERNAME or TRIVY_PASSWORD environment variable like I would expect based on: https://github.com/aquasecurity/trivy-operator/blob/b4bab3568480c5cbb3393dec5542ee694cf4263f/pkg/plugins/trivy/image.go#L208-L239

      containers:
      - args:
        - -c
        - trivy image --slow 'docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz'
          --scanners vuln,secret --image-config-scanners secret   --skip-db-update  --cache-dir
          /tmp/trivy/.cache --quiet --list-all-pkgs --format json > /tmp/scan/result_load-generator.json
          &&  bzip2 -c /tmp/scan/result_load-generator.json | base64
        command:
        - /bin/sh
        env:
        - name: TRIVY_SEVERITY
          valueFrom:
            configMapKeyRef:
              key: trivy.severity
              name: trivy-operator-trivy-config
              optional: true
        - name: TRIVY_IGNORE_UNFIXED
          valueFrom:
            configMapKeyRef:
              key: trivy.ignoreUnfixed
              name: trivy-operator-trivy-config
              optional: true
        - name: TRIVY_OFFLINE_SCAN
          valueFrom:
            configMapKeyRef:
              key: trivy.offlineScan
              name: trivy-operator-trivy-config
              optional: true
        - name: TRIVY_JAVA_DB_REPOSITORY
          valueFrom:
            configMapKeyRef:
              key: trivy.javaDbRepository
              name: trivy-operator-trivy-config
              optional: true
        - name: TRIVY_TIMEOUT
          valueFrom:
            configMapKeyRef:
              key: trivy.timeout
              name: trivy-operator-trivy-config
              optional: true
        - name: TRIVY_SKIP_FILES
          valueFrom:
            configMapKeyRef:
              key: trivy.skipFiles
              name: trivy-operator-trivy-config
              optional: true
        - name: TRIVY_SKIP_DIRS
          valueFrom:
            configMapKeyRef:
              key: trivy.skipDirs
              name: trivy-operator-trivy-config
              optional: true
        - name: HTTP_PROXY
          valueFrom:
            configMapKeyRef:
              key: trivy.httpProxy
              name: trivy-operator-trivy-config
              optional: true
        - name: HTTPS_PROXY
          valueFrom:
            configMapKeyRef:
              key: trivy.httpsProxy
              name: trivy-operator-trivy-config
              optional: true
        - name: NO_PROXY
          valueFrom:
            configMapKeyRef:
              key: trivy.noProxy
              name: trivy-operator-trivy-config
              optional: true
        image: ghcr.io/aquasecurity/trivy:0.51.2
        imagePullPolicy: IfNotPresent
        name: load-generator
        resources:
          limits:
            cpu: 500m
            memory: 500M
          requests:
            cpu: 100m
            memory: 100M
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          privileged: false
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /tmp
          name: tmp
        - mountPath: /tmp/scan
          name: scanresult
chen-keinan commented 4 months ago

@sidpalas I'm cutting v0.21.2 today to include more logs around the root cause. please test with it again once it released, so we could have more context.

sidpalas commented 4 months ago

Thanks @chen-keinan 🙏

I just upgraded and now see the following ERROR in the trivy-operator pod logs:

2024-06-04T10:41:32Z    ERROR   reconciler.scan job     Scan job container      {"job": "trivy-system/scan-vulnerabilityreport-7b5bf85fb9", "container": "load-generator", "status.reason": "Error", "status.message": "2024-06-04T10:41:29Z\tFATAL\tFatal error\timage scan error: scan error: unable to initialize a scanner: unable to initialize an image scanner: unable to find the specified image \"docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz\" in [\"docker\" \"containerd\" \"podman\" \"remote\"]: 4 errors occurred:\n\t* docker error: unable to inspect the image (docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\n\t* containerd error: containerd socket not found: /run/containerd/containerd.sock\n\t* podman error: unable to initialize Podman client: no podman socket found: stat podman/podman.sock: no such file or directory\n\t* remote error: GET https://index.docker.io/v2/sidpalas/devops-directive-kubernetes-course-load-generator-python/manifests/foobarbaz: UNAUTHORIZED: authentication required; [map[Action:pull Class: Name:sidpalas/devops-directive-kubernetes-course-load-generator-python Type:repository]]\n\n\n"}

Could it be related to the fact that I am running this on Civo cloud with k3s. Does Trivy assume a particular container runtime? (the error appears to try docker, then containerd, then podman).

EDIT: I just tried the same configuration on a GKE cluster and got the same error:

2024-06-04T11:00:53Z    ERROR   reconciler.scan job     Scan job container      {"job": "trivy-system/scan-vulnerabilityreport-7b5bf85fb9", "container": "load-generator", "status.reason": "Error", "status.message": "2024-06-04T11:00:45Z\tFATAL\tFatal error\timage scan error: scan error: unable to initialize a scanner: unable to initialize an image scanner: unable to find the specified image \"docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz\" in [\"docker\" \"containerd\" \"podman\" \"remote\"]: 4 errors occurred:\n\t* docker error: unable to inspect the image (docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\n\t* containerd error: containerd socket not found: /run/containerd/containerd.sock\n\t* podman error: unable to initialize Podman client: no podman socket found: stat podman/podman.sock: no such file or directory\n\t* remote error: GET https://index.docker.io/v2/sidpalas/devops-directive-kubernetes-course-load-generator-python/manifests/foobarbaz: UNAUTHORIZED: authentication required; [map[Action:pull Class: Name:sidpalas/devops-directive-kubernetes-course-load-generator-python Type:repository]]\n\n\n"}
chen-keinan commented 4 months ago

@sidpalas have you followed all our options for private registries

sidpalas commented 4 months ago

@sidpalas have you followed all our options for private registries

I am using a kubernetes.io/dockerconfigjson imagePullSecret referenced in the pod spec.

I have not tried filesystem scanning or an imagePullSecret referenced by a ServiceAccount.

chen-keinan commented 4 months ago

t

Looking at the generated k8s scan job I also don't see a TRIVY_USERNAME or TRIVY_PASSWORD environment variable like I would expect based on:

https://github.com/aquasecurity/trivy-operator/blob/b4bab3568480c5cbb3393dec5542ee694cf4263f/pkg/plugins/trivy/image.go#L208-L239

      containers:
      - args:
        - -c
        - trivy image --slow 'docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz'
          --scanners vuln,secret --image-config-scanners secret   --skip-db-update  --cache-dir
          /tmp/trivy/.cache --quiet --list-all-pkgs --format json > /tmp/scan/result_load-generator.json
          &&  bzip2 -c /tmp/scan/result_load-generator.json | base64
        command:
        - /bin/sh
        env:
        - name: TRIVY_SEVERITY
          valueFrom:
            configMapKeyRef:
              key: trivy.severity
              name: trivy-operator-trivy-config
              optional: true
        - name: TRIVY_IGNORE_UNFIXED
          valueFrom:
            configMapKeyRef:
              key: trivy.ignoreUnfixed
              name: trivy-operator-trivy-config
              optional: true
        - name: TRIVY_OFFLINE_SCAN
          valueFrom:
            configMapKeyRef:
              key: trivy.offlineScan
              name: trivy-operator-trivy-config
              optional: true
        - name: TRIVY_JAVA_DB_REPOSITORY
          valueFrom:
            configMapKeyRef:
              key: trivy.javaDbRepository
              name: trivy-operator-trivy-config
              optional: true
        - name: TRIVY_TIMEOUT
          valueFrom:
            configMapKeyRef:
              key: trivy.timeout
              name: trivy-operator-trivy-config
              optional: true
        - name: TRIVY_SKIP_FILES
          valueFrom:
            configMapKeyRef:
              key: trivy.skipFiles
              name: trivy-operator-trivy-config
              optional: true
        - name: TRIVY_SKIP_DIRS
          valueFrom:
            configMapKeyRef:
              key: trivy.skipDirs
              name: trivy-operator-trivy-config
              optional: true
        - name: HTTP_PROXY
          valueFrom:
            configMapKeyRef:
              key: trivy.httpProxy
              name: trivy-operator-trivy-config
              optional: true
        - name: HTTPS_PROXY
          valueFrom:
            configMapKeyRef:
              key: trivy.httpsProxy
              name: trivy-operator-trivy-config
              optional: true
        - name: NO_PROXY
          valueFrom:
            configMapKeyRef:
              key: trivy.noProxy
              name: trivy-operator-trivy-config
              optional: true
        image: ghcr.io/aquasecurity/trivy:0.51.2
        imagePullPolicy: IfNotPresent
        name: load-generator
        resources:
          limits:
            cpu: 500m
            memory: 500M
          requests:
            cpu: 100m
            memory: 100M
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          privileged: false
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /tmp
          name: tmp
        - mountPath: /tmp/scan
          name: scanresult

you mean you do not see the envvar mapped to secret ? can you please share how you create the secret. can you write here an example of the command ?

sidpalas commented 4 months ago

you mean you do not see the envvar mapped to secret ?

Yes, based on the docs and the code in (trivy-operator/pkg/plugins/trivy/image.go I expected trivy to generate a secret in the trivy-system namespace and then use that to set the TRIVY_USERNAME or TRIVY_PASSWORD env vars in the job it creates to do the scanning. Is that generally correct? (I could be misunderstanding the code).


The command I used to create my image pull secret (in the namespace of the application) was:

kubectl create secret -n demo-app docker-registry dockerconfigjson \
  --docker-email=${DOCKER_EMAIL} \
  --docker-username=${DOCKER_USERNAME} \
  --docker-password=${DOCKER_PASSWORD} \
  --docker-server=docker.io

I am confident the imagePullSecret is correct because the pods corresponding to the deployment with the private image are able to pull the image successfully.

Does that provide the info you were looking for?

chen-keinan commented 4 months ago

@sidpalas I see that you created the secret in demo-app namespace but the workload descriptor example you put above is deployed to default namespace

sidpalas commented 4 months ago

Sorry, I stripped out the namespace (and other things such as env vars) from the deployment in my initial issue since it didn't seem relevant at the time. The deployment and the image pull secret are both in the demo-app namespace.

chen-keinan commented 4 months ago

@sidpalas you mean that you have confirm that it do not work when both (secret and deployment) are on the same namespace?

note: you can use the : Fourth Option: Define Secrets through Trivy-Operator configuration on the doc and specified secret in a different namespace

sidpalas commented 4 months ago

Correct. The setup I have is:

The resulting behavior im seeing:


I did try the fourth option (specifying operator.privateRegistryScanSecretNames in my helm values file referencing my image pull secret) but the scan job still failed.

chen-keinan commented 4 months ago

@sidpalas strange trivy-operator has e2e tests for private registry. any additional info you can share on your env. (on prem. cloud or else) which can help ?

sidpalas commented 4 months ago

It is strange!

My setup is pretty plain. I initially was using a managed cluster on Civo cloud, but tested with GKE and saw the same result.

I took a look at the end to end test you linked. The pod spec (https://github.com/aquasecurity/trivy-operator/blob/b83c17837c80f043b86321cf056ee17ada4a08b7/tests/e2e/image-private-registries-config/workload/00-pod.yaml) doesn't actually contain an imagePullSecret. Should it?

chen-keinan commented 4 months ago

@sidpalas I'm happy to debug it over a call with your. find me on trivy-operator channel on aqua security slack and we could schedule something if you want of course

sidpalas commented 4 months ago

Thank you for the offer! (and for all the time spent helping me so far!).

I'll distill my example into its simplest form (essentially duplicating the e2e test but with my own private repo) and see if I can get it working. If not, I'll hop into the aquasec slack! 🙏

chen-keinan commented 4 months ago

Thank you for the offer! (and for all the time spent helping me so far!).

I'll distill my example into its simplest form (essentially duplicating the e2e test but with my own private repo) and see if I can get it working. If not, I'll hop into the aquasec slack! 🙏

sure feel free to ping me, at any case I'll give it another test try for private registry manually

sidpalas commented 4 months ago

Was able to replicate the failure in a KinD cluster (v1.30.0):

Secret created with:

    kubectl create secret docker-registry artcred \
        --docker-email=${EMAIL} \
        --docker-username=${USERNAME} \
        --docker-password=${PASSWORD} \
        --docker-server=docker.io

Trivy installed with:

    helm upgrade --install trivy-operator aqua/trivy-operator \
        --namespace trivy-system \
        --create-namespace \
        --version 0.23.2

Test pods:

apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  containers:
    - name: mycontainer
      image: docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz
      imagePullPolicy: Always
      command: ["/bin/sh"]
      args:
        - -c
        - sleep 9999
  imagePullSecrets:
    - name: artcred
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: private-reg
imagePullSecrets:
  - name: artcred
---
apiVersion: v1
kind: Pod
metadata:
  name: mypod-sa
spec:
  containers:
    - name: mycontainer
      image: docker.io/sidpalas/devops-directive-kubernetes-course-load-generator-python:foobarbaz
      imagePullPolicy: Always
      command: ["/bin/sh"]
      args:
        - -c
        - sleep 9999
  serviceAccountName: private-reg
---

As before, the pods using the private image, successfully pull and run it, but the trivy scan jobs fail.

chen-keinan commented 4 months ago

@sidpalas Thanks for update I will run a quick test manually tomorrow and update you

sidpalas commented 4 months ago

Thank you!

Also, just as another data point:

Initially I was using a token with readonly permissions on dockerhub, but created a new one with read/write permissions. Both had the same failed result.

sidpalas commented 4 months ago

To further reduce the delta between the e2e test setup and mine I also tested with:

I'm still seeing the same result though. Really stumped as to what could be the root cause here... 🤔

sidpalas commented 4 months ago

Ah, I think the issue was with my --docker-server in the imagepullsecret!

I had used:

--docker-server=docker.io

But after changing it to:

--docker-server=https://index.docker.io/v1

The jobs succeeded and the vulnerability scan reports show up!

Interesting that it works fine with docker.io for k8s pulling the image. Presumably Trivy was being stricter about the requirement for it to be a fully qualified domain name and discarding my secret as invalid.

sidpalas commented 4 months ago

This is likely the culprit!

https://github.com/aquasecurity/trivy-operator/blob/b83c17837c80f043b86321cf056ee17ada4a08b7/pkg/docker/config.go#L126-L142

GetServerFromDockerAuthKey("docker.io") => docker.io

vs.

GetServerFromDockerAuthKey("https://index.docker.io/v1") => index.docker.io


It looks like this is something that may be special cased in upstream kubernetes so that docker.io properly resolves to index.docker.io 😳: https://github.com/kubernetes/kubernetes/blob/9e2075b3c87061d25759b0ad112266f03601afd8/pkg/credentialprovider/keyring.go#L128-L158

@chen-keinan -- Do you think its worth special casing in trivy-operator so the behavior matches Kubernetes (or adding a warning to the docs)?

chen-keinan commented 4 months ago

@sidpalas thanks for the update. yes I think it worth updating docs or supporting in the code to avoid experience issues in future. feel free to raise a PR if you have time

chen-keinan commented 4 months ago

@sidpalas do you want to keep this issue open ?

sidpalas commented 4 months ago

@sidpalas do you want to keep this issue open ?

I'd say we can either close it or I can update the title/description to reflect the more narrow scope of the issue (I.e. only impacting image pull secrets using the shorthand registry name for docker.io)

Do you have a preference? I'm fine with either!

chen-keinan commented 4 months ago

Thanks for update. Closing the issue