argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.12k stars 3.21k forks source link

v3.5.11: EKS Pod Identity S3 Artifact is not working #13659

Closed tmsquill closed 2 months ago

tmsquill commented 2 months ago

Pre-requisites

What happened? What did you expect to happen?

I am attempting to use the EKS Pod Identity feature to grant workflow pods access to an S3 bucket for use with artifacts. I have followed the documentation listed here and here. Furthermore I am aware of issues https://github.com/argoproj/argo-workflows/issues/12949 and https://github.com/argoproj/argo-workflows/issues/12650.

The relevant excerpt from my workflow manifest is:

    - name: generate-colormap
      inputs:
        artifacts:
          - name: raster
            path: "{{inputs.parameters.localWorkingDirectory}}/input.tiff"
            s3:
              key: "{{inputs.parameters.sourceRaster}}"
              endpoint: s3.amazonaws.com
              bucket: example-bucket
              region: us-east-1
              useSDKCreds: true
      ...

This workflow is using the argo-workflow service account which is annotated correctly as per the EKS Pod Identity requirements:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::<REDACTED>:role/<REDACTED>
  ...
  name: argo-workflow
  namespace: workflows

I have run kubectl describe on the relevant pod to see the environment variables assigned to the init container to confirm the Pod Identity association is working correctly, as you can see the five expected environment variables are present.

Init Containers:
  init:
    Image:      quay.io/argoproj/argoexec:v3.5.11
    Port:       <none>
    Host Port:  <none>
    Command:
      argoexec
      init
      --loglevel
      info
      --log-format
      text
    Environment:
      ...
      AWS_STS_REGIONAL_ENDPOINTS:              regional
      AWS_DEFAULT_REGION:                      us-east-2
      AWS_REGION:                              us-east-2
      AWS_CONTAINER_CREDENTIALS_FULL_URI:      http://169.254.170.23/v1/credentials
      AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE:  /var/run/secrets/pods.eks.amazonaws.com/serviceaccount/eks-pod-identity-token
    ...

Furthermore, I have tested this Pod Identity association with the following (note the use of the relevant service account) and it works:

kubectl run --rm -it --namespace workflows aws-cli-pod --image amazon/aws-cli:latest --overrides='{ "spec": { "serviceAccount": "argo-workflow" } }' --command -- /bin/sh -c "aws s3 ls s3://example-bucket"

Logs from the init container which is responsible for fetching the artifacts from the S3 bucket is listed below:

time="2024-09-24T21:46:23.193Z" level=info msg="Starting Workflow Executor" version=v3.5.11
time="2024-09-24T21:46:23.198Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-09-24T21:46:23.198Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=workflows podName=pmtiles-converter-raster8ck49-generate-colormap-180367858 templateName=generate-colormap version="&Version{Version:v3.5.11,BuildDate:2024-09-20T14:09:00Z,GitCommit:25bbb71cced32b671f9ad35f0ffd1f0ddb8226ee,GitTag:v3.5.11,GitTreeState:clean,GoVersion:go1.21.13,Compiler:gc,Platform:linux/amd64,}"
time="2024-09-24T21:46:23.287Z" level=info msg="Loading script source to /argo/staging/script"
time="2024-09-24T21:46:23.287Z" level=info msg="Start loading input artifacts..."
time="2024-09-24T21:46:23.287Z" level=info msg="Downloading artifact: raster"
time="2024-09-24T21:46:23.287Z" level=info msg="Specified artifact path /tmp/workflow/input.tiff overlaps with volume mount at /tmp/workflow. Extracting to volume mount"
time="2024-09-24T21:46:23.287Z" level=info msg="S3 Load path: /mainctrfs/tmp/workflow/input.tiff.tmp, key: <REDACTED>.tif"
time="2024-09-24T21:46:23.297Z" level=info msg="Creating minio client using AWS SDK credentials"
2024/09/24 21:46:23 Ignoring, HTTP credential provider invalid endpoint host, "169.254.170.23", only loopback hosts are allowed. <nil>
time="2024-09-24T21:46:23.297Z" level=warning msg="Non-transient error: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors"
time="2024-09-24T21:46:23.297Z" level=info msg="Load artifact" artifactName=raster duration=10.195002ms error="failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors" key=<REDACTED>.tif
time="2024-09-24T21:46:23.297Z" level=error msg="executor error: artifact raster failed to load: failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors"
time="2024-09-24T21:46:23.297Z" level=info msg="Alloc=10694 TotalAlloc=16611 Sys=23397 NumGC=4 Goroutines=4"
time="2024-09-24T21:46:23.297Z" level=fatal msg="artifact raster failed to load: failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors"

Due to the format of the log 2024/09/24 21:46:23 Ignoring, HTTP credential provider invalid endpoint host, "169.254.170.23", only loopback hosts are allowed. being different, it makes me think something outside of the Argo-related code in the init container is denying hitting the credentials endpoint.

I am using EKS Pod Identity with other software in my EKS cluster without any problems.

EKS Version: v1.30.3-eks-a18cd3a Argo Workflows Helm Chart: 0.42.3

It's unclear to me if this feature does work and is not documented correctly, or if the feature does not work as intended, or if I am just doing something wrong.

Version(s)

v3.5.11

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

See the description of issue.

Logs from the workflow controller

Not relevant to this issue.

Logs from in your workflow's wait container

Not relevant to this issue.
agilgur5 commented 2 months ago

Furthermore I am aware of issues #12949

Version(s)

v3.5.11

12949 states that the AWS SDK Go change was not backported to 3.5. I still don't believe it has, so this would be a duplicate.

  • [x] I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.

This should work on 3.6.0-rc1 and :latest. You did not describe attempts with :latest.

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

See the description of issue.

The relevant excerpt from my workflow manifest is:

An excerpt is also not a reproducible workflow.

tmsquill commented 2 months ago

Oh I think I understand now, I just reviewed the comments in #12949 in detail and tried :latest and it works. I am not comfortable running :latest in my environment so I will wait for 3.6.X to release and use an alternative approach in the meantime. Thanks for clearing this up.

The documentation for v3.5.11 is misleading. The section here for IRSA / Pod Identity shows how to use a feature that does not work / is not implemented.

agilgur5 commented 2 months ago

The documentation for v3.5.11 is misleading. The section here for IRSA / Pod Identity shows how to use a feature that does not work / is not implemented.

IRSA apparently does work but not Pod Identity as far as I understand. I don't think anyone knew Pod Identity required specific SDK support until #12650, which was written by an AWS employee.

You're also reading the docs for latest as opposed to the docs for release-3.5, although I did accidentally backport the docs sentence on Pod Identity in https://github.com/argoproj/argo-workflows/commit/9b337f8ef95e1ae6db036febeb65ea7167ad8b9b. Both of those lines were added very recently and I only remembered #12650 myself this weekend

agilgur5 commented 2 months ago

Correction on the above, the reference to Pod Identity in the above commit is for self-hosted set-ups, and that should indeed still work for EKS and non-EKS if you install it yourself.

But that's different from and predates the built-in EKS Pod Identity, which AWS released less than a year ago in Nov 2023 apparently. Meanwhile, the Pod Identity Webhook dates back to ~June 2019.

And both are different from IRSA, see also this AWS blog post comparing the two.

So uh, I'd chalk up that confusion to AWS itself 🙃