argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.21k forks source link

Can't use Git commands that write to FS when default container user is not root #11376

Open aldeed opened 1 year ago

aldeed commented 1 year ago

Pre-requisites

What happened/what you expected to happen?

Setup:

As the container command, try to run any git command that needs to write to the filesystem. For example, git fetch. There will be an error logged and a 128 exit code.

The initial error suggests running git config --global --add safe.directory /tmp/repo, but if you do that, you just get a different permission error.

Here's the logging from running the provided reproduction workflow:

/tmp/repo
total 1264
drwxr-xr-x   23 root     root          4096 Jul 17 14:01 .
drwxrwxrwt    1 root     root          4096 Jul 17 14:01 ..
-rw-r--r--    1 root     root           126 Jul 17 14:01 .clang-format
-rw-r--r--    1 root     root           378 Jul 17 14:01 .codecov.yml
drwxr-xr-x    2 root     root          4096 Jul 17 14:01 .devcontainer
-rw-r--r--    1 root     root           507 Jul 17 14:01 .dockerignore
drwxr-xr-x    4 root     root          4096 Jul 17 14:01 .git
-rw-r--r--    1 root     root           282 Jul 17 14:01 .gitattributes
drwxr-xr-x    4 root     root          4096 Jul 17 14:01 .github
-rw-r--r--    1 root     root          1130 Jul 17 14:01 .gitignore
-rw-r--r--    1 root     root          1273 Jul 17 14:01 .golangci.yml
-rw-r--r--    1 root     root            88 Jul 17 14:01 .markdownlint.yaml
-rw-r--r--    1 root     root           121 Jul 17 14:01 .mlc_config.json
-rw-r--r--    1 root     root          2096 Jul 17 14:01 .spelling
-rw-r--r--    1 root     root        822907 Jul 17 14:01 CHANGELOG.md
-rw-r--r--    1 root     root            60 Jul 17 14:01 CODEOWNERS
-rw-r--r--    1 root     root            50 Jul 17 14:01 CONTRIBUTING.md
-rw-r--r--    1 root     root          3304 Jul 17 14:01 Dockerfile
-rw-r--r--    1 root     root          2828 Jul 17 14:01 Dockerfile.windows
-rw-r--r--    1 root     root         11352 Jul 17 14:01 LICENSE
-rw-r--r--    1 root     root         29627 Jul 17 14:01 Makefile
-rw-r--r--    1 root     root           147 Jul 17 14:01 OWNERS
-rw-r--r--    1 root     root          7161 Jul 17 14:01 README.md
-rw-r--r--    1 root     root          1710 Jul 17 14:01 SECURITY.md
-rw-r--r--    1 root     root          8792 Jul 17 14:01 USERS.md
drwxr-xr-x    4 root     root          4096 Jul 17 14:01 api
drwxr-xr-x    5 root     root          4096 Jul 17 14:01 cmd
drwxr-xr-x    2 root     root          4096 Jul 17 14:01 community
drwxr-xr-x    2 root     root          4096 Jul 17 14:01 config
-rw-r--r--    1 root     root           178 Jul 17 14:01 cosign.pub
drwxr-xr-x    3 root     root          4096 Jul 17 14:01 dev
drwxr-xr-x    8 root     root          4096 Jul 17 14:01 docs
drwxr-xr-x    2 root     root          4096 Jul 17 14:01 errors
drwxr-xr-x    8 root     root         12288 Jul 17 14:01 examples
-rw-r--r--    1 root     root         13259 Jul 17 14:01 go.mod
-rw-r--r--    1 root     root        196675 Jul 17 14:01 go.sum
drwxr-xr-x    7 root     root          4096 Jul 17 14:01 hack
drwxr-xr-x    6 root     root          4096 Jul 17 14:01 manifests
-rw-r--r--    1 root     root          8712 Jul 17 14:01 mkdocs.yml
drwxr-xr-x    3 root     root          4096 Jul 17 14:01 persist
drwxr-xr-x    6 root     root          4096 Jul 17 14:01 pkg
drwxr-xr-x    4 root     root          4096 Jul 17 14:01 sdks
drwxr-xr-x   18 root     root          4096 Jul 17 14:01 server
-rw-r--r--    1 root     root          3174 Jul 17 14:01 tasks.yaml
drwxr-xr-x    5 root     root          4096 Jul 17 14:01 test
drwxr-xr-x    5 root     root          4096 Jul 17 14:01 ui
drwxr-xr-x   34 root     root          4096 Jul 17 14:01 util
-rw-r--r--    1 root     root          1867 Jul 17 14:01 version.go
drwxr-xr-x   21 root     root          4096 Jul 17 14:01 workflow
node-red
fatal: detected dubious ownership in repository at '/tmp/repo'
To add an exception for this directory, call:

    git config --global --add safe.directory /tmp/repo

Workaround

Setting the security context seems to work as a workaround:

  securityContext:
    runAsNonRoot: true
    runAsUser: 1000

When I do that, the repo directory is owned by that user.

But something seems incorrect because these docs say "By default, all workflow pods run as root", and yet when the image has USER node default, then whoami prints node and these errors happen.

Possible solutions

If possible, ensure that the repo is cloned by (owned by) the image's default user. Alternatively, if the intention is to override images to always run as root regardless of their default user, then that seems to not be happening.

And securityContext does not seem to have a runAsRoot option, so it would then be impossible to run as root if that's not the default user in the image.

Version

v3.4.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: repro
  namespace: repro
spec:
  serviceAccountName: argo-service-account
  entrypoint: entrypoint
  templates:
    - name: entrypoint
      inputs:
        artifacts:
          - name: repo
            path: /tmp/repo
            git:
              repo: https://github.com/argoproj/argo-workflows.git
              revision: master
      container:
        image: nodered/node-red:latest
        workingDir: "{{ inputs.artifacts.repo.path }}"
        command:
          - /bin/bash
          - "-c"
        args:
          - |
            pwd
            ls -al
            whoami
            git fetch

Logs from the workflow controller

Defaulted container "controller" out of: controller, csql-argo-workflows-argo-workflows-auth-proxy-workload
time="2023-07-17T14:01:26.681Z" level=info msg="Processing workflow" namespace=previews workflow=repro6
time="2023-07-17T14:01:26.686Z" level=info msg="Updated phase  -> Running" namespace=previews workflow=repro6
time="2023-07-17T14:01:26.686Z" level=info msg="Pod node repro6 initialized Pending" namespace=previews workflow=repro6
time="2023-07-17T14:01:26.769Z" level=info msg="Created pod: repro6 (repro6)" namespace=previews workflow=repro6
time="2023-07-17T14:01:26.769Z" level=info msg="TaskSet Reconciliation" namespace=previews workflow=repro6
time="2023-07-17T14:01:26.769Z" level=info msg=reconcileAgentPod namespace=previews workflow=repro6
time="2023-07-17T14:01:26.777Z" level=info msg="Workflow update successful" namespace=previews phase=Running resourceVersion=419297989 workflow=repro
time="2023-07-17T14:01:36.770Z" level=info msg="Processing workflow" namespace=previews workflow=repro6
time="2023-07-17T14:01:36.770Z" level=info msg="Task-result reconciliation" namespace=previews numObjs=0 workflow=repro6
time="2023-07-17T14:01:36.770Z" level=info msg="node changed" namespace=previews new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=repro6 old.message= old.phase=Pending old.progress=0/1 workflow=repro6
time="2023-07-17T14:01:36.771Z" level=info msg="TaskSet Reconciliation" namespace=previews workflow=repro6
time="2023-07-17T14:01:36.771Z" level=info msg=reconcileAgentPod namespace=previews workflow=repro6
time="2023-07-17T14:01:36.780Z" level=info msg="Workflow update successful" namespace=previews phase=Running resourceVersion=419298140 workflow=repro
time="2023-07-17T14:01:46.781Z" level=info msg="Processing workflow" namespace=previews workflow=repro6
time="2023-07-17T14:01:46.782Z" level=info msg="Task-result reconciliation" namespace=previews numObjs=0 workflow=repro6
time="2023-07-17T14:01:46.782Z" level=info msg="node unchanged" namespace=previews nodeID=repro6 workflow=repro6
time="2023-07-17T14:01:46.782Z" level=info msg="TaskSet Reconciliation" namespace=previews workflow=repro6
time="2023-07-17T14:01:46.782Z" level=info msg=reconcileAgentPod namespace=previews workflow=repro6
time="2023-07-17T14:01:59.300Z" level=info msg="Processing workflow" namespace=previews workflow=repro6
time="2023-07-17T14:01:59.300Z" level=info msg="Task-result reconciliation" namespace=previews numObjs=1 workflow=repro6
time="2023-07-17T14:01:59.300Z" level=info msg="task-result changed" namespace=previews nodeID=repro6 workflow=repro6
time="2023-07-17T14:01:59.300Z" level=info msg="node changed" namespace=previews new.message= new.phase=Running new.progress=0/1 nodeID=repro6 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=repro6
time="2023-07-17T14:01:59.300Z" level=info msg="TaskSet Reconciliation" namespace=previews workflow=repro6
time="2023-07-17T14:01:59.300Z" level=info msg=reconcileAgentPod namespace=previews workflow=repro6
time="2023-07-17T14:01:59.310Z" level=info msg="Workflow update successful" namespace=previews phase=Running resourceVersion=419298494 workflow=repro
time="2023-07-17T14:02:09.311Z" level=info msg="Processing workflow" namespace=previews workflow=repro6
time="2023-07-17T14:02:09.311Z" level=info msg="Task-result reconciliation" namespace=previews numObjs=1 workflow=repro6
time="2023-07-17T14:02:09.311Z" level=info msg="Pod failed: Error (exit code 128)" displayName=repro6 namespace=previews pod=repro6 templateName=entrypoint workflow=repro6
time="2023-07-17T14:02:09.311Z" level=info msg="node changed" namespace=previews new.message="Error (exit code 128)" new.phase=Failed new.progress=0/1 nodeID=repro6 old.message= old.phase=Running old.progress=0/1 workflow=repro6
time="2023-07-17T14:02:09.312Z" level=info msg="TaskSet Reconciliation" namespace=previews workflow=repro6
time="2023-07-17T14:02:09.312Z" level=info msg=reconcileAgentPod namespace=previews workflow=repro6
time="2023-07-17T14:02:09.312Z" level=info msg="Updated phase Running -> Failed" namespace=previews workflow=repro6
time="2023-07-17T14:02:09.312Z" level=info msg="Updated message  -> Error (exit code 128)" namespace=previews workflow=repro6
time="2023-07-17T14:02:09.312Z" level=info msg="Marking workflow completed" namespace=previews workflow=repro6
time="2023-07-17T14:02:09.312Z" level=info msg="Doesn't match with archive label selector. Skipping Archive" namespace=previews workflow=repro6
time="2023-07-17T14:02:09.312Z" level=info msg="Checking daemoned children of " namespace=previews workflow=repro6
time="2023-07-17T14:02:09.317Z" level=info msg="cleaning up pod" action=deletePod key=previews/repro6-1340600742-agent/deletePod
time="2023-07-17T14:02:09.323Z" level=info msg="Workflow update successful" namespace=previews phase=Failed resourceVersion=419298671 workflow=repro6
time="2023-07-17T14:02:09.928Z" level=info msg="cleaning up pod" action=labelPodCompleted key=previews/repro6/labelPodCompleted

Logs from in your workflow's wait container

time="2023-07-17T14:01:49 UTC" level=info msg="Starting Workflow Executor" version=v3.4.8
time="2023-07-17T14:01:49 UTC" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-07-17T14:01:49 UTC" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC"
time="2023-07-17T14:01:49 UTC" level=info msg="Starting deadline monitor"
time="2023-07-17T14:01:58 UTC" level=info msg="Main container completed" error="<nil>"
time="2023-07-17T14:01:58 UTC" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-07-17T14:01:58 UTC" level=info msg="No output parameters"
time="2023-07-17T14:01:58 UTC" level=info msg="No output artifacts"
time="2023-07-17T14:01:58 UTC" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: stage/2023/07/17/repro6/repro6/main.log"
time="2023-07-17T14:01:58 UTC" level=info msg="Creating minio client using static credentials" endpoint=s3.amazonaws.com
time="2023-07-17T14:01:58 UTC" level=info msg="Saving file to s3" bucket=REDACTED endpoint=s3.amazonaws.com key=stage/2023/07/
17/repro6/repro6/main.log path=/tmp/argo/outputs/logs/main.log
time="2023-07-17T14:01:59 UTC" level=info msg="Save artifact" artifactName=main-logs duration=684.427591ms error="<nil>" key=stage/2023/07/
17/repro6/repro6/main.log
time="2023-07-17T14:01:59 UTC" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-07-17T14:01:59 UTC" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-07-17T14:01:59 UTC" level=info msg="Create workflowtaskresults 201"
time="2023-07-17T14:01:59 UTC" level=info msg="Alloc=8024 TotalAlloc=19973 Sys=32893 NumGC=6 Goroutines=10"
time="2023-07-17T14:01:59 UTC" level=info msg="Deadline monitor stopped"
aldeed commented 1 year ago

Note that the workaround works for https Git repo, but with SSH there is still a Host key verification failed. error

JPZ13 commented 1 year ago

@tico24 - could you chime in on this one? I'm a little lost on the expected behavior for declaring a non-root user in a docker image and how that should play with a workflow?

tico24 commented 1 year ago

@JPZ13 not really much to add? I'm guessing the inbuilt git artifact thing silently requires a root user and it shouldn't.

aldeed commented 1 year ago

@JPZ13 @tico24 @terrytangyuan It's hard for me to know what the real issue is here because I don't know what is correct and what is wrong, but if it's helpful, this issue could also be described like this:

Your docs say that all containers run as root but actually they don't if the image has a non-root default user.

So there are two potential ways to fix:

If all containers should run as root:

If all containers should not run as root:

Edit:

By "non-root default user" I mean a Dockerfile like this:

FROM node:18-bullseye-slim

USER node
terrytangyuan commented 1 year ago

Remove that statement from the docs

We should probably remove the misleading statement. It seems inaccurate especially now that we only have the emissary executor.

Fix the Git feature so that it clones the repo using the same default user, OR provide an option to force it to run as root

@JPZ13 @caelan-io @rohankmr414 @weafscast Any update on https://github.com/argoproj/argo-workflows/pull/11149? We should move back to upstream go-git and check if the newest version already has an option for this.

caelan-io commented 1 year ago

Agreed. Great to see go-git is maintained again.

We paused our efforts on go-git as we thought @weafscast had a draft PR. Let us know if we should pick it back up.

terrytangyuan commented 1 year ago

We paused our efforts on go-git as we thought @weafscast had a draft PR. Let us know if we should pick it back up.

Yes, feel free to pick it up since it's been quiet for a while.

aldeed commented 1 year ago

That all sounds good. It's more of a separate feature request, but I do think that having a way to force it to run as root would be useful even after the Git issue is fixed. I tried runAsUser: 0 but it seems to ignore it. If you agree, I can submit an enhancement proposal.

terrytangyuan commented 1 year ago

Yes feel free to submit an enhancement proposal. We'll see if other community members are also interested and then prioritize accordingly.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.