aquasecurity / trivy-operator

Kubernetes-native security toolkit
https://aquasecurity.github.io/trivy-operator/latest
Apache License 2.0
1.24k stars 205 forks source link

trivy-operator: Pull Images via CRI #101

Open isugimpy opened 2 years ago

isugimpy commented 2 years ago

Reached out via Twitter and was redirected here.

From my understanding, the operator is pulling container images based on the image path it scrapes from live objects on the cluster. However, it's pulling them directly, bypassing configuration on the host. Personally, I'd like the option to utilize the CRI on the actual host so that we can be sure the image that gets pulled actually matches what's running for real.

To cite a specific example, the clusters I manage are running CRI-O, and utilize the registry mirror functionality to do rewrites of the image pulls live. This allows us to attempt to pull from a cluster-local image cache first, with fallback to our canonical registry if that fails. Additionally, we use the rewrites to ensure that a pull from something like Docker Hub is forcefully redirected to pull through our canonical registry so we can cache and scan images. By having Trivy respect the behavior of the node it runs on, we can better ensure that the images it scans are accurately pulled.

itaysk commented 2 years ago

Thanks for opening an issue @isugimpy Trivy tries to scan images from local image store if available, and if not it will pull them from registry. The interaction with local image store is similar to docker export. Today Trivy can do this local scanning with Docker and (recently) containerd. We looked into interacting with CRI but it seems that CRI doesn't have an export equivalent method [1] [2]. Instead, we could look into supporting CRI-O directly if that makes sense, but then you will have to mount the socket into the Trivy scanner pod for it to work. Another approach as suggested by @erikgb in #118 which is to still pull images, but at least be more accurate by taking the image digest from the running Pod status.

isugimpy commented 2 years ago

If it ensures accuracy, having a single pod that mounts in the socket from its host doesn't seem unreasonable at all. Would love to see that. Not being able to ensure that we can pull and scan the images from the same host that they'd be pulled from to be actually run is, in my opinion, a blocker to being able to use the operator in a live environment.

erikgb commented 2 years ago

If it ensures accuracy, having a single pod that mounts in the socket from its host doesn't seem unreasonable at all. Would love to see that. Not being able to ensure that we can pull and scan the images from the same host that they'd be pulled from to be actually run is, in my opinion, a blocker to being able to use the operator in a live environment.

I agree, but image name + image digest is the canonical refererence to an image. So if using digest, you don't need to mount the host socket. At least we should have an option to NOT mount anything from the host.

isugimpy commented 2 years ago

You're absolutely correct that the combo is canonical, but if there's no other way to resolve that canonical reference due to a rewrite, it'd leave things in undefined behavior territory. Your point makes a ton of sense in the 90+% case. I just worry about it for the smaller percentage of users that take advantage of this feature (containerd also offers the rewrite capability, so this isn't limited specifically to CRI-O).

itaysk commented 2 years ago

I just tried to run nginx from ECR just to see how how the Pod Status looks like, and it has the following:

containerID: containerd://366a98117285434b8869ff692a020a136b01b7f6444aaa6e83a158ded431584b
image: public.ecr.aws/nginx/nginx:stable
imageID: public.ecr.aws/nginx/nginx@sha256:76503f7e2c5cf0910640e70264588f264016acdcdb91c99bc7b007efa971c708

Seems like if a rewrite was happening, the image/imageID registry reference would reflect the rewritten one. Is it not the case? (if you could confim @isugimpy)

itaysk commented 2 years ago

Also I wanted to point out a couple of additional options to address the original question:

  1. There's an option to specify the mirror as configuration (somehow lost in the docs):

    Mirror for the registry , e.g. trivy.registry.mirror.index.docker.io: mirror.io would use mirror.io to get images originated from index.docker.io

  2. There's an experimental (also undocumented) feature to scan workloads by scanning the running container filesystem (vs the container image) using the trivy.command = filesystem configuration. It's currently broken due to https://github.com/aquasecurity/trivy-operator/issues/49. But I mention here to here what people think about this approach.

isugimpy commented 2 years ago

Seems like if a rewrite was happening, the image/imageID registry reference would reflect the rewritten one. Is it not the case? (if you could confim @isugimpy)

From one of my clusters:

- containerID: cri-o://defd9a0b99597b814d7f3dc74c125cbdb918d585da43a60cbfbcfafa1658365a
  image: quay.io/superq/chrony-exporter:v0.1.0
  imageID: quay.io/superq/chrony-exporter@sha256:30b6f76b72fe11e769ee3e18d5468d8fbb9a9467640689258d7100eee81db18b

I can confirm with 100% certainty that despite this pointing at quay.io, it was pulled from our internal passthrough cache via mirror rewrites at runtime. CRI-O (and containerd as I recall) doesn't present the result of the rewrite to the Kubelet.

  1. There's an option to specify the mirror as configuration (somehow lost in the docs):

This definitely helps the cause. It's not ideal to have to duplicate the mirror config, but it definitely would work. The one bummer is that it doesn't have fallback to a secondary mirror available. That's hardly a dealbreaker though. If the cluster-local cache is unavailable for some reason, I've got bigger problems than if Trivy is scanning.

  1. There's an experimental (also undocumented) feature to scan workloads by scanning the running container filesystem (vs the container image) using the trivy.command = filesystem configuration. It's currently broken due to FS scanning doesn't work with Trivy version >= 0.23.0 #49. But I mention here to here what people think about this approach.

Filesystem could work, it's an intriguing approach! Wouldn't that require a daemonset though, since not every image is present on every host?

itaysk commented 2 years ago

I can confirm with 100% certainty that despite this pointing at quay.io, it was pulled from our internal passthrough cache via mirror rewrites at runtime. CRI-O (and containerd as I recall) doesn't present the result of the rewrite to the Kubelet.

Thanks for checking. For our knowledge, to configure CRI-O for mirroring, are you using this configuration? https://github.com/containers/image/blob/main/registries.conf#L68

itaysk commented 2 years ago

Filesystem could work, it's an intriguing approach! Wouldn't that require a daemonset though, since not every image is present on every host?

This feature is meant to be used with vulnerabilityReports.scanJobsInSameNamespace configuration, which makes the scan job run in the namespace of the scanned workload instead of in the trivy-operator dedicated namespace. Running in the target ns is presumed to succeed in starting a new workload (the scan job) with the scanned image (permissions wise).

The feature originally suggested in this issue - scanning the image from local store by mounting the socket into Trivy container - has stricter requirement - of scheduling the scan job on the same node as the scanned workload.

chen-keinan commented 2 years ago

@isugimpy related PR #150 that solve the fs scanning issue

chen-keinan commented 2 years ago

@isugimpy will trivy-operator with mode:fileSystem, scanning will solve this issue ?

erikgb commented 2 years ago

@isugimpy will trivy-operator with mode:fileSystem, scanning will solve this issue ?

No, at least not for us. As Openshift SCC does not allow an unprivileged user to run containers as the root user, ref.:

https://github.com/aquasecurity/trivy-operator/blob/5b9498d9d9fe186fb4861e69b161c9af87d61b38/deploy/helm/values.yaml#L161-L162

JAORMX commented 2 years ago

@erikgb trivy-operator should anyways not run with the restricted SCC, it should use at least anyuid. If it's running with a low privileged SCC it should be fixed in the OLM packaging.

erikgb commented 2 years ago

@erikgb trivy-operator should anyways not run with the restricted SCC, it should use at least anyuid. If it's running with a low privileged SCC it should be fixed in the OLM packaging.

But it is not the operator that is the problem, but the jobs it schedules.

JAORMX commented 2 years ago

@erikgb I see... well, adding the needed annotation to schedule them with a higher privilege SCC shouldn't be too problematic. But, either way, permissions to use that SCC would need to be added to the packaging though (it's an RBAC rule).

avishefi commented 1 year ago

Related discussion: https://github.com/aquasecurity/trivy/discussions/4957