canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
104 stars 50 forks source link

`Scan images` job is failing with different error messages #1080

Closed DnPlas closed 1 month ago

DnPlas commented 1 month ago

Bug Description

The scheduled job for scanning images has been failing for the last three runs as seen here. At first sight it does not seem obvious what could be wrong. It's also strange that two of the three failing jobs are showing one error message about TOOMANYREQUESTS, while the other one is showing Error response from daemon: Get "https://gcr.io/v2/knative-releases/knative.dev/eventing/cmd/controller/manifests/sha256:9a881d503404349afefadce4ab82358d2f7f1775a97edd6704c0e3f132674ced": EOF.

To Reproduce

Happening on the CI, check the Actions>Scan images jobs for more details.

Environment

The CI that runs the Scan images job.

Relevant Log Output

# From two of the three failing runs

+ docker run -v /var/run/docker.sock:/var/run/docker.sock -v /home/ubuntu/actions-runner/_work/bundle-kubeflow/bundle-kubeflow:/home/ubuntu/actions-runner/_work/bundle-kubeflow/bundle-kubeflow -w /home/ubuntu/actions-runner/_work/bundle-kubeflow/bundle-kubeflow --name=scanner aquasec/trivy image --timeout 30m -f json -o trivy-reports/charmedkubeflow-persistenceagent-2-2-0-8af6d3c.json --ignore-unfixed charmedkubeflow/persistenceagent:2.2.0-8af6d3c
2024-09-20T02:00:18Z    INFO    [db] Need to update DB
2024-09-20T02:00:18Z    INFO    [db] Downloading DB...  repository="ghcr.io/aquasecurity/trivy-db:2"
2024-09-20T02:00:19Z    FATAL   Fatal error init error: DB error: failed to download vulnerability DB: database download error: oci download error: failed to fetch the layer: GET https://ghcr.io/v2/aquasecurity/trivy-db/blobs/sha256:07a258410a90caab5d530266a2fc5b669c9729c8a71e27f4cb3967b73f5c584b: TOOMANYREQUESTS: retry-after: 558.942µs, allowed: 44000/minute
Error: Process completed with exit code 1.

# From one of the three runs
+ docker pull gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:9a881d503404349afefadce4ab82358d2f7f1775a97edd6704c0e3f132674ced
Scan image gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:9a881d503404349afefadce4ab82358d2f7f1775a97edd6704c0e3f132674ced report in trivy-reports/gcr-io-knative-releases-knative-dev-eventing-cmd-controller@sha256-9a881d503404349afefadce4ab82358d2f7f1775a97edd6704c0e3f132674ced.json
Error response from daemon: Get "https://gcr.io/v2/knative-releases/knative.dev/eventing/cmd/controller/manifests/sha256:9a881d503404349afefadce4ab82358d2f7f1775a97edd6704c0e3f132674ced": EOF
Error: Process completed with exit code 1.

Additional Context

No response

syncronize-issues-to-jira[bot] commented 1 month ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6285.

This message was autogenerated

kimwnasptd commented 1 month ago

My best guess is that this is because of the self-hosted runners, which I introduced as part of my PR https://github.com/canonical/bundle-kubeflow/actions/runs/10951750159/workflow#L18 https://github.com/canonical/bundle-kubeflow/pull/1038

Let me try to revert and use the regular GH runners and see if this will still be the case. With the regular GH runners we hadn't seen such errors in the past.

DnPlas commented 1 month ago

After reverting the change of the runs-on field, I'm still running into the same issue. See here for instance.

EDIT:

I have found a relevant issue, looks like the problem is with trivy itself --> https://github.com/aquasecurity/trivy-action/issues/389 and https://github.com/orgs/community/discussions/139074

DnPlas commented 1 month ago

Re-opening as the last scan also failed with this, but because of the java DB. See here for the log.