argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.95k stars 3.19k forks source link

Executor tries to use imagePullSecrets to pull a container image even if anonymous pull is enabled #9802

Open vitalyrychkov opened 2 years ago

vitalyrychkov commented 2 years ago

Pre-requisites

What happened/what you expected to happen?

Kubernetes pulls the same images without using imagePullSecrets if anonymous pull is allowed. Argo executor shall also pull a container image (to check the cmd/args value) without using imagePullSecrets if anonymous pull is enabled.

Version

3.4.1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

The issue is related to a private images

Logs from the workflow controller

kubectl logs -n argo deploy/argo-helm-argo-workflows-workflow-controller | grep acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.499Z" level=info msg="Processing workflow" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.589Z" level=info msg="Updated phase  -> Running" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.589Z" level=info msg="DAG node acm-adhoc-bps-db-version-1665584107 initialized Running" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.590Z" level=info msg="All of node acm-adhoc-bps-db-version-1665584107.db-version dependencies [] completed" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.591Z" level=info msg="DAG node acm-adhoc-bps-db-version-1665584107-3268391737 initialized Running" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.592Z" level=info msg="All of node acm-adhoc-bps-db-version-1665584107.db-version.db-version-task dependencies [] completed" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.594Z" level=info msg="Pod node acm-adhoc-bps-db-version-1665584107-1331475456 initialized Pending" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.605Z" level=error msg="Mark error node" error="failed to look-up entrypoint/cmd for image \"artifacts-scm.dstcorp.net/algo-docker/acm/bps:acm-5.5.3-00\", you must either explicitly specify the command, or list the image's command in the index: https://argoproj.github.io/argo-workflows/workflow-executors/#emissary-emissary: secrets \"acm-registry-creds\" not found" namespace=acmtmp nodeName=acm-adhoc-bps-db-version-1665584107.db-version.db-version-task workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.605Z" level=info msg="node acm-adhoc-bps-db-version-1665584107-1331475456 phase Pending -> Error" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.605Z" level=info msg="node acm-adhoc-bps-db-version-1665584107-1331475456 message: failed to look-up entrypoint/cmd for image \"artifacts-scm.dstcorp.net/algo-docker/acm/bps:acm-5.5.3-00\", you must either explicitly specify the command, or list the image's command in the index: https://argoproj.github.io/argo-workflows/workflow-executors/#emissary-emissary: secrets \"acm-registry-creds\" not found" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.605Z" level=info msg="node acm-adhoc-bps-db-version-1665584107-1331475456 finished: 2022-10-12 16:17:29.605737508 +0000 UTC" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.605Z" level=error msg="Mark error node" error="task 'acm-adhoc-bps-db-version-1665584107.db-version.db-version-task' errored: failed to look-up entrypoint/cmd for image \"artifacts-scm.dstcorp.net/algo-docker/acm/bps:acm-5.5.3-00\", you must either explicitly specify the command, or list the image's command in the index: https://argoproj.github.io/argo-workflows/workflow-executors/#emissary-emissary: secrets \"acm-registry-creds\" not found" namespace=acmtmp nodeName=acm-adhoc-bps-db-version-1665584107.db-version.db-version-task workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.605Z" level=info msg="node acm-adhoc-bps-db-version-1665584107-1331475456 message: task 'acm-adhoc-bps-db-version-1665584107.db-version.db-version-task' errored: failed to look-up entrypoint/cmd for image \"artifacts-scm.dstcorp.net/algo-docker/acm/bps:acm-5.5.3-00\", you must either explicitly specify the command, or list the image's command in the index: https://argoproj.github.io/argo-workflows/workflow-executors/#emissary-emissary: secrets \"acm-registry-creds\" not found" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.608Z" level=info msg="Outbound nodes of acm-adhoc-bps-db-version-1665584107-3268391737 set to [acm-adhoc-bps-db-version-1665584107-1331475456]" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.608Z" level=info msg="node acm-adhoc-bps-db-version-1665584107-3268391737 phase Running -> Error" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.608Z" level=info msg="node acm-adhoc-bps-db-version-1665584107-3268391737 finished: 2022-10-12 16:17:29.608557778 +0000 UTC" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.608Z" level=info msg="Checking daemoned children of acm-adhoc-bps-db-version-1665584107-3268391737" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.611Z" level=info msg="Outbound nodes of acm-adhoc-bps-db-version-1665584107 set to [acm-adhoc-bps-db-version-1665584107-1331475456]" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.611Z" level=info msg="node acm-adhoc-bps-db-version-1665584107 phase Running -> Error" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.611Z" level=info msg="node acm-adhoc-bps-db-version-1665584107 finished: 2022-10-12 16:17:29.61131521 +0000 UTC" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.611Z" level=info msg="Checking daemoned children of acm-adhoc-bps-db-version-1665584107" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.611Z" level=info msg="TaskSet Reconciliation" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.611Z" level=info msg=reconcileAgentPod namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.611Z" level=info msg="Updated phase Running -> Error" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.611Z" level=info msg="Marking workflow completed" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.611Z" level=info msg="Marking workflow as pending archiving" namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.611Z" level=info msg="Checking daemoned children of " namespace=acmtmp workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.617Z" level=info msg="cleaning up pod" action=deletePod key=acmtmp/acm-adhoc-bps-db-version-1665584107-1340600742-agent/deletePod
time="2022-10-12T16:17:29.624Z" level=info msg="Workflow update successful" namespace=acmtmp phase=Error resourceVersion=104803467 workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.628Z" level=info msg="archiving workflow" namespace=acmtmp uid=6f8c2dca-17f4-4081-b564-b3f82720a28e workflow=acm-adhoc-bps-db-version-1665584107
time="2022-10-12T16:17:29.659Z" level=info msg="Queueing Error workflow acmtmp/acm-adhoc-bps-db-version-1665584107 for delete in 5m0s due to TTL"

Logs from in your workflow's wait container

No wait logs available, seems argo did not come that far.

sarabala1979 commented 1 year ago

@vitalyrychkov can you provide more details like failed PodSpec and your env setup? Is there a way to reproduce locally?

vitalyrychkov commented 1 year ago

@sarabala1979
Hi, thank you for patience, it took some time as I tried to create a meaningful test case for you. I have created a primitive workflow and a server deployment based on the same image in my private artifactory. The server, which has the imagePullSecret definition in the deployment, started fine in a pod. I did not create the mentioned secret. Then I have submitted the workflow based on the same image :

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: cmdtest-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
  annotations:
    workflows.argoproj.io/description: |
      This is a test for image command and entrypoint
spec:
  entrypoint: cmdtest
  templates:
  - name: cmdtest
    container:
      image: 'artifacts.mycorp.net/docker/docserver:latest'

In the first run I got the same error, although I did not specify any imagePullSecret in the Workflow:

failed to look-up entrypoint/cmd for image "artifacts.mycorp.net/docker/docserver:latest", you must either explicitly specify the command, or list the image's command in the index: https://argoproj.github.io/argo-workflows/workflow-executors/#emissary-emissary: secrets "app-registry-creds" not found

Then I tried with a non-existent version of my image: image: 'artifacts.mycorp.net/docker/hugo:1.2.3'
In this case Argo receives the corresponding message from the registry:

failed to look-up entrypoint/cmd for image "artifacts.mycorp.net/docker/hugo:1.2.3", you must either explicitly specify the command, or list the image's command in the index: https://argoproj.github.io/argo-workflows/workflow-executors/#emissary-emissary: GET https://artifacts.mycorp.net/v2/docker/hugo/manifests/1.2.3: MANIFEST_UNKNOWN: The named manifest is not known to the registry.; map[manifest:hugo/1.2.3/manifest.json]

Reverted back to the latest and submitted:image: 'artifacts.mycorp.net/docker/docserver:latest' And all of a sudden Argo can download the image and run the container! I guess it could be a some kind of cached first deployment's settings in Argo or Kubelet ???

I will keep an eye on the issue and try to nail it down as soon as it re-occurs or maybe someone else reports the same. For now I would appreciate if someone checks the code if the manifest pull function uses anything like a credentials cache.

Thank you

anhqqt commented 1 year ago

I met the same issue Image pull error: User "system:serviceaccount:argo-workflows:argo-workflows-controller" cannot get resource "secrets" in API group "" in the namespace "argo-workflows" . From the previous, the workflow could normally run in EKS 1.22

But after I created another EKS 1.23 and install the Argo Workflow helm chart with the same values.yaml file, this issue happened. Even if the workflow crd is in the same namespace as the controller, the controller cannot read the imagePullSecret.

My workaround is to turn off the controller.rbac.create and controller.serviceAccount.create in the helm values file. Manually create the ServiceAccount, ClusterRole (with Get Secret permission), ClusterRoleBinding, and put the SA name into controller.serviceAccount.name. Then the workflow controller pod will use the SA I created and is able to read the imagePullSecret

vitalyrychkov commented 1 year ago

RIght, my workaround was the same, however I did not have to disable *.create values, just added a ClusterRole to read the secret with a specific name in all namespaces and a ClusterRoleBinding to the workflow-controller's SA.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

noamokman commented 7 months ago

Hey, I just added this to the values yaml for the chart

argo-workflows:
  controller:
    rbac:
      secretWhitelist:
        - image-pull-secret

replace the name of the secret with any secret you may have. Works for me when pulling from ECR.