argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.21k forks source link

S3 Artifact error on ARM #13588

Open wesleyhaws opened 2 months ago

wesleyhaws commented 2 months ago

Pre-requisites

What happened? What did you expect to happen?

When running a working workflow template on a ARM based architecture node, vs a x86 node, it will fail to retrieve logs and fail the workflow with the error:

Error (exit code 1): failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated. For verbose messaging see aws.Config.CredentialsChainVerboseErrors

The workflow will launch and run successfully but when the logs are attempting to be placed into s3 it fails.

NOTE: The works 100% fine with x86 architectures. I tried it with no configuration changes between arm and x86 (amd64) node architectures.

To me this appears like it's an issue with the configuration in argoexec or whatever is running the logic to assume a aws role to read and write log files to a target s3 bucket.

Logs from the pod itself:

time="2024-09-11T20:03:12.557Z" level=info msg="capturing logs" argo=true
WORKS!
time="2024-09-11T20:03:13.558Z" level=info msg="sub-process exited" argo=true error="<nil>"

Version(s)

Helm chart: "0.42.1", quay.io/argoproj/argocli:v3.5.10, quay.io/argoproj/argocli@sha256:ccc55a32c9739f6d3e7649fe8b896ea90980fae95c4d318a43610dc80d20ddf9

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: argo-workflows-testing-role
  namespace: argo-workflows
automountServiceAccountToken: true

---

apiVersion: v1
kind: Secret
metadata:
  name: argo-workflows-testing-role.service-account-token
  namespace: argo-workflows
  annotations:
    kubernetes.io/service-account.name: argo-workflows-testing-role
type: kubernetes.io/service-account-token

---

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: argo-workflows-testing-role
rules:
- apiGroups: # https://argo-workflows.readthedocs.io/en/latest/workflow-rbac/
    - argoproj.io
  resources:
    - workflowtaskresults
  verbs:
    - create
    - patch
- apiGroups:
    - ""
  resources:
    - pods
  verbs:
    - get
    - patch

---

# Will attach the first defined service account to the above role.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: argo-workflows-testing-rb
subjects:
- kind: ServiceAccount
  name: argo-workflows-testing-role
  namespace: argo-workflows
roleRef:
  kind: ClusterRole
  name: argo-workflows-testing-role
  apiGroup: rbac.authorization.k8s.io

---

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: trigger-test
spec:
  ttlStrategy:
    secondsAfterCompletion: 86400
    secondsAfterFailure: 3600    
  entrypoint: trigger
  templates:
    - name: trigger
      serviceAccountName: argo-workflows-testing-role
      metadata:
        annotations:
          kubectl.kubernetes.io/default-container: main
        labels:
          role: argo-workflows-testing-role
      script:
        image: "bash:latest"
        command: [bash]
        source: |
         echo "WORKS!"

Logs from the workflow controller

kubectl logs deployment/argo-workflows-workflow-controller --tail=100 | grep trigger-test

time="2024-09-11T20:02:47.002Z" level=info msg="Processing workflow" Phase= ResourceVersion=1982565562 namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:02:47.016Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=0 workflow=trigger-test-bnzpr
time="2024-09-11T20:02:47.016Z" level=info msg="Updated phase  -> Running" namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:02:47.017Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:02:47.017Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:02:47.017Z" level=info msg="Pod node trigger-test-bnzpr initialized Pending" namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:02:47.037Z" level=info msg="Created pod: trigger-test-bnzpr (trigger-test-bnzpr)" namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:02:47.037Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:02:47.037Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:02:47.048Z" level=info msg="Workflow update successful" namespace=argo-workflows phase=Running resourceVersion=1982565566 workflow=trigger-test-bnzpr
time="2024-09-11T20:02:57.038Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1982565566 namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:02:57.039Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=0 workflow=trigger-test-bnzpr
time="2024-09-11T20:02:57.054Z" level=info msg="node changed" namespace=argo-workflows new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=trigger-test-bnzpr old.message= old.phase=Pending old.progress=0/1 workflow=trigger-test-bnzpr
time="2024-09-11T20:02:57.054Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:02:57.054Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:02:57.063Z" level=info msg="Workflow update successful" namespace=argo-workflows phase=Running resourceVersion=1982565775 workflow=trigger-test-bnzpr
time="2024-09-11T20:03:20.210Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1982565775 namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:03:20.210Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=1 workflow=trigger-test-bnzpr
time="2024-09-11T20:03:20.210Z" level=info msg="task-result changed" namespace=argo-workflows nodeID=trigger-test-bnzpr workflow=trigger-test-bnzpr
time="2024-09-11T20:03:20.210Z" level=info msg="node changed" namespace=argo-workflows new.message= new.phase=Running new.progress=0/1 nodeID=trigger-test-bnzpr old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=trigger-test-bnzpr
time="2024-09-11T20:03:20.210Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:03:20.210Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:03:20.216Z" level=info msg="cleaning up pod" action=terminateContainers key=argo-workflows/trigger-test-bnzpr/terminateContainers
time="2024-09-11T20:03:20.220Z" level=info msg="Workflow update successful" namespace=argo-workflows phase=Running resourceVersion=1982566292 workflow=trigger-test-bnzpr
time="2024-09-11T20:03:20.232Z" level=info msg="https://172.20.0.1:443/api/v1/namespaces/argo-workflows/pods/trigger-test-bnzpr/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-09-11T20:03:23.738Z" level=info msg="signaled container" container=wait error="Internal error occurred: error executing command in container: failed to exec in container: failed to create exec \"fea15772efec59bbc64b0cd77d695139bc02e71933228e8f7f016252c6190948\": cannot exec in a deleted state: unknown" namespace=argo-workflows pod=trigger-test-bnzpr stderr="<nil>" stdout="<nil>"
time="2024-09-11T20:03:34.266Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1982566292 namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:03:34.267Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=1 workflow=trigger-test-bnzpr
time="2024-09-11T20:03:34.268Z" level=info msg="task-result changed" namespace=argo-workflows nodeID=trigger-test-bnzpr workflow=trigger-test-bnzpr
time="2024-09-11T20:03:34.269Z" level=info msg="Pod failed: Error (exit code 1): failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors" displayName=trigger-test-bnzpr namespace=argo-workflows pod=trigger-test-bnzpr templateName=trigger workflow=trigger-test-bnzpr
time="2024-09-11T20:03:34.269Z" level=info msg="node changed" namespace=argo-workflows new.message="Error (exit code 1): failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors" new.phase=Error new.progress=0/1 nodeID=trigger-test-bnzpr old.message= old.phase=Running old.progress=0/1 workflow=trigger-test-bnzpr
time="2024-09-11T20:03:34.269Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:03:34.269Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:03:34.269Z" level=info msg="Updated phase Running -> Error" namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:03:34.270Z" level=info msg="Updated message  -> Error (exit code 1): failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors" namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:03:34.270Z" level=info msg="Marking workflow completed" namespace=argo-workflows workflow=trigger-test-bnzpr
time="2024-09-11T20:03:34.280Z" level=info msg="Workflow update successful" namespace=argo-workflows phase=Error resourceVersion=1982566645 workflow=trigger-test-bnzpr
time="2024-09-11T20:03:34.281Z" level=info msg="Queueing Error workflow argo-workflows/trigger-test-bnzpr for delete in 24h0m0s due to TTL"
time="2024-09-11T20:03:34.305Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo-workflows/trigger-test-bnzpr/labelPodCompleted
time="2024-09-11T20:03:53.738Z" level=info msg="cleaning up pod" action=killContainers key=argo-workflows/trigger-test-bnzpr/killContainers

Logs from in your workflow's wait container

kubectl logs -n argo-workflows trigger-test-bnzpr -c wait

time="2024-09-11T21:55:51.385Z" level=info msg="Starting Workflow Executor" version=v3.5.10
time="2024-09-11T21:55:51.388Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-09-11T21:55:51.388Z" level=info msg="Executor initialized" deadline="2024-09-11 23:54:58 +0000 UTC" includeScriptOutput=false namespace=argo-workflows podName=trigger-test-bnzpr templateName=trigger version="&Version{Version:v3.5.10,BuildDate:2024-08-01T05:12:26Z,GitCommit:25829927431d9a0f46d17b72ae74aedb8d700884,GitTag:v3.5.10,GitTreeState:clean,GoVersion:go1.21.12,Compiler:gc,Platform:linux/arm64,}"
time="2024-09-11T21:55:51.404Z" level=info msg="Starting deadline monitor"
time="2024-09-11T21:56:31.445Z" level=info msg="Main container completed" error="<nil>"
time="2024-09-11T21:56:31.445Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-09-11T21:56:31.445Z" level=info msg="No output parameters"
time="2024-09-11T21:56:31.445Z" level=info msg="No output artifacts"
time="2024-09-11T21:56:31.445Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: trigger-test-bnzpr/trigger-test-bnzpr/main.log"
time="2024-09-11T21:56:31.454Z" level=info msg="Creating minio client using assumed-role credentials" roleArn="arn:aws:iam::<redacted>:role/my-artifact-role"
time="2024-09-11T21:56:34.613Z" level=warning msg="Non-transient error: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors"
time="2024-09-11T21:56:34.613Z" level=info msg="Save artifact" artifactName=main-logs duration=3.168001265s error="failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors" key=trigger-test-bnzpr/trigger-test-bnzpr/main.log
time="2024-09-11T21:56:34.613Z" level=error msg="executor error: failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors"
time="2024-09-11T21:56:34.630Z" level=info msg="Alloc=10955 TotalAlloc=17092 Sys=24421 NumGC=4 Goroutines=8"
time="2024-09-11T21:56:34.645Z" level=fatal msg="failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors"

k logs deployment/argo-workflows-server --tail=300 | grep trigger-test

time="2024-09-11T20:02:47.227Z" level=info duration=6.731845ms method=GET path=/api/v1/workflows/argo-workflows/trigger-test-bnzpr size=9044 status=0
time="2024-09-11T20:02:47.388Z" level=info duration=9.271235ms method=GET path=/api/v1/workflows/argo-workflows/trigger-test-bnzpr size=9044 status=0
time="2024-09-11T20:02:47.448Z" level=info duration=6.964976ms method=GET path=/api/v1/workflows/argo-workflows/trigger-test-bnzpr size=9044 status=0
agilgur5 commented 2 months ago

This is a follow-up from Slack. The error is coming from the S3 client so I imagine it's an upstream bug

As I mentioned there, could you provide more environment info such as the ARM instance type you used and anything else (EKS version?)?

agilgur5 commented 2 months ago

NOTE: On the below replace the <replace_image> with any image that has arm support, it doesn't really matter as it's happening for everything

Could you still give a public image here that you've confirmed has the issue? For instance, busybox The easier and more minimal to reproduce, the better.

argoexec whatever is running the logic to assume a aws role to read and write log files to a target s3 bucket.

The init container downloads artifacts and the wait container uploads them. Those both run the argoexec image and is where the error would come from, correct.

You don't have any artifacts defined in your Workflow but it sounds like you have archiveLogs: true configured? That wasn't in the config snippet you provided, could you provide more of your config for repro purposes?

(Not sure if the wait container is created in another namespace, please advise.)

It's a sidecar to your Workflow, so it depends on where you ran the Workflow -- there's no metadata.namespace specified in your WorkflowTemplate and no corresponding Workflow, so I can't determine that myself

Since you have a TTL defined, you'll need to check the Pod's logs before the TTL. It will probably end with the same error message, but there might be some other useful debugging info before that

wesleyhaws commented 2 months ago

EKS Version: 1.29 Node Family: ["c6g", "c7g", "c8g"]

It's a hybrid environment so launch argo-workflows & argocd (along with all workflows) on graviton (ARM) instance types. The instances are running common roles that are in the "Trusted Relationships" of the above defined artifact role. I know the artifact role is setup properly because it is able to store and retrieve logs from the backend s3 bucket. The EC2 iam role applied to nodes are the same between x86 and arm nodes.

The helm definition is setup with:

argo-workflows-0-42-1:
  fullnameOverride: argo-workflows
  singleNamespace: false
  workflow:
    serviceAccount:
      create: true
  server:
    serviceAccount:
      create: true
   ingress:
      ...
   extraArgs:
      ...
   sso:
      ...
  affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: <node_group_selector_name>
                  operator: In
                  values:
                    - <target_node>
 useDefaultArtifactRepo: true
 useStaticCredentials: false
 artifactRepository:
    archiveLogs: true
    s3:
      bucket: my-target-s3-artifacts-bucket
      endpoint: s3.amazonaws.com
      region: my-region
      roleARN: arn:aws:iam::<redacted>:role/my-artifact-role
controller:
    workflowDefaults:
      spec:
        activeDeadlineSeconds: 7200 # workflow is killed after 2h
        ttlStrategy:
          secondsAfterCompletion: 172800 # workflow is deleted after 2d
    metricsConfig:
      enabled: true
    serviceMonitor:
      enabled: true
      additionalLabels:
        release: ....
    affinity:
      ...

I have included the pod logs above - but it was for the main pod. I'll try to rerun this and get the logs from the sidecard wait container and edit the above to include them.

The S3 bucket doesn't have anything special other than a policy to prevent uploading unencrypted artifacts. Policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RequireSecureTransport",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::my-target-s3-artifacts-bucket",
                "arn:aws:s3:::my-target-s3-artifacts-bucket/*"
            ],
            "Condition": {
                "NumericLessThan": {
                    "s3:TlsVersion": "1.2"
                }
            }
        }
    ]
}

It also has the following all enabled:

Bucket owner enforced
Block public access to buckets and objects granted through new access control lists (ACLs)
Block public access to buckets and objects granted through any access control lists (ACLs)
Block public access to buckets and objects granted through new public bucket or access point policies
Block public and cross-account access to buckets and objects through any public bucket or access point policies
default SSE-S3 encryption
wesleyhaws commented 2 months ago

@agilgur5 I have included the wait container logs.

NOTE: Here are the wait container logs when running on x86:

time="2024-09-11T17:05:10.173Z" level=info msg="Starting Workflow Executor" version=v3.5.10
time="2024-09-11T17:05:10.177Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-09-11T17:05:10.177Z" level=info msg="Executor initialized" deadline="2024-09-11 19:05:07 +0000 UTC" includeScriptOutput=false namespace=argo-workflows podName=trigger-test-s92vt templateName=trigger version="&Version{Version:v3.5.10,BuildDate:2024-08-01T05:11:25Z,GitCommit:25829927431d9a0f46d17b72ae74aedb8d700884,GitTag:v3.5.10,GitTreeState:clean,GoVersion:go1.21.12,Compiler:gc,Platform:linux/amd64,}"
time="2024-09-11T17:05:10.192Z" level=info msg="Starting deadline monitor"
time="2024-09-11T17:05:12.192Z" level=info msg="Main container completed" error="<nil>"
time="2024-09-11T17:05:12.192Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-09-11T17:05:12.192Z" level=info msg="No output parameters"
time="2024-09-11T17:05:12.192Z" level=info msg="No output artifacts"
time="2024-09-11T17:05:12.193Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: trigger-test-s92vt/trigger-test-s92vt/main.log"
time="2024-09-11T17:05:12.200Z" level=info msg="Creating minio client using assumed-role credentials" roleArn="arn:aws:iam::<redacted>:role/my-artifact-role"
time="2024-09-11T17:05:12.245Z" level=info msg="Saving file to s3" bucket=my-target-s3-artifacts-bucket endpoint=s3.amazonaws.com key=trigger-test-s92vt/trigger-test-s92vt/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-09-11T17:05:12.317Z" level=info msg="Save artifact" artifactName=main-logs duration=124.538014ms error="<nil>" key=trigger-test-s92vt/trigger-test-s92vt/main.log
time="2024-09-11T17:05:12.317Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-09-11T17:05:12.317Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-09-11T17:05:12.334Z" level=info msg="Alloc=8117 TotalAlloc=17868 Sys=29797 NumGC=5 Goroutines=12"
time="2024-09-11T17:05:12.340Z" level=info msg="Deadline monitor stopped"
agilgur5 commented 2 months ago

Thanks for the extra details!

The helm definition is setup with

So you don't have workflow archiving set-up, just archive logs? Usually they're used together so just double-checking

Also reminder to use syntax highlighting in markdown code blocks, i.e. "```yaml". I edit them in for a lot of people 😅

NOTE: Here are the wait container logs when running on x86:

Yep those are normal. I see you updated the description with the ARM variant which does have the error in the logs. Thanks!

wesleyhaws commented 2 months ago

Tried adding this to the service account this morning:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: argo-workflows-testing-role
  namespace: argo-workflows
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::<redacted>:role/my-other-iam-role
automountServiceAccountToken: true

This actually makes the job "pass" but has an internal server error when attempting to access the logs.

Logs from wait container:

time="2024-09-12T14:38:30.724Z" level=info msg="Starting Workflow Executor" version=v3.5.10
time="2024-09-12T14:38:30.728Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-09-12T14:38:30.728Z" level=info msg="Executor initialized" deadline="2024-09-12 16:38:28 +0000 UTC" includeScriptOutput=false namespace=argo-workflows podName=trigger-test-lhp4c templateName=trigger version="&Version{Version:v3.5.10,BuildDate:2024-08-01T05:12:26Z,GitCommit:25829927431d9a0f46d17b72ae74aedb8d700884,GitTag:v3.5.10,GitTreeState:clean,GoVersion:go1.21.12,Compiler:gc,Platform:linux/arm64,}"
time="2024-09-12T14:38:30.745Z" level=info msg="Starting deadline monitor"
time="2024-09-12T14:38:32.745Z" level=info msg="Main container completed" error="<nil>"
time="2024-09-12T14:38:32.745Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-09-12T14:38:32.745Z" level=info msg="No output parameters"
time="2024-09-12T14:38:32.745Z" level=info msg="No output artifacts"
time="2024-09-12T14:38:32.745Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: trigger-test-lhp4c/trigger-test-lhp4c/main.log"
time="2024-09-12T14:38:32.754Z" level=info msg="Creating minio client using assumed-role credentials" roleArn="arn:aws:iam::<redacted>:role/my-artifact-role"
time="2024-09-12T14:38:32.844Z" level=info msg="Saving file to s3" bucket=my-target-s3-artifacts-bucket endpoint=s3.amazonaws.com key=trigger-test-lhp4c/trigger-test-lhp4c/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-09-12T14:38:32.955Z" level=info msg="Save artifact" artifactName=main-logs duration=210.087779ms error="<nil>" key=trigger-test-lhp4c/trigger-test-lhp4c/main.log
time="2024-09-12T14:38:32.955Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-09-12T14:38:32.955Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-09-12T14:38:32.969Z" level=info msg="Alloc=8054 TotalAlloc=17586 Sys=29029 NumGC=5 Goroutines=12"
time="2024-09-12T14:38:32.975Z" level=info msg="Deadline monitor stopped"

From the above logs it appears to have stored logs into s3 properly. So I looked and yes it is there. So then I looked at the argo-workflows-server pod logs:

time="2024-09-12T14:40:00.729Z" level=info msg="Get artifact file" artifactName=main-logs namespace=argo-workflows nodeId=trigger-test-lhp4c workflowName=trigger-test-lhp4c
time="2024-09-12T14:40:00.742Z" level=info msg="Creating minio client using assumed-role credentials" roleArn="arn:aws:iam::<redacted>:role/my-artifact-role"
time="2024-09-12T14:40:03.881Z" level=info msg="Check if directory" artifactName=main-logs duration=3.142196956s error="NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors" key=trigger-test-lhp4c/trigger-test-lhp4c/main.log
time="2024-09-12T14:40:03.881Z" level=error msg="Artifact Server returned internal error" error="NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors"
time="2024-09-12T14:40:03.881Z" level=info duration=3.157595799s method=GET path=/artifact-files/argo-workflows/workflows/trigger-test-lhp4c/trigger-test-lhp4c/outputs/main-logs size=22 status=500
time="2024-09-12T14:40:05.696Z" level=info msg="selected SSO RBAC service account for user" email=<redacted> loginServiceAccount=<redacted> serviceAccount=<redacted> ssoDelegated=false ssoDelegationAllowed=false subject=00u1xd1tgaNMdgPNo697
time="2024-09-12T14:40:05.696Z" level=info msg="Get artifact file" artifactName=main-logs namespace=argo-workflows nodeId=trigger-test-lhp4c workflowName=trigger-test-lhp4c
time="2024-09-12T14:40:05.704Z" level=info msg="Creating minio client using assumed-role credentials" roleArn="arn:aws:iam::<redacted>:role/my-artifact-role"
time="2024-09-12T14:40:06.165Z" level=info duration="82.058µs" method=GET path=index.html size=487 status=0
time="2024-09-12T14:40:08.853Z" level=info msg="Check if directory" artifactName=main-logs duration=3.152147996s error="NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors" key=trigger-test-lhp4c/trigger-test-lhp4c/main.log
time="2024-09-12T14:40:08.853Z" level=error msg="Artifact Server returned internal error" error="NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors"
time="2024-09-12T14:40:08.853Z" level=info duration=3.161819171s method=GET path=/artifact-files/argo-workflows/workflows/trigger-test-lhp4c/trigger-test-lhp4c/outputs/main-logs size=22 status=500
time="2024-09-12T14:40:26.164Z" level=info duration="74.739µs" method=GET path=index.html size=487 status=0

Now as you can see from the above the server fails to retrieve these logs even though it seems to know what role to assume? Weird. Is there a certain configuration on the server that I should try? There seems to be something about the functionality here that is changing between x86 and arm. Very strange.

wesleyhaws commented 2 months ago

I just found this in the docs. It seems like I need to add this same eks annotation to the server pod: https://argo-workflows.readthedocs.io/en/latest/configure-artifact-repository/#aws-s3-irsa

A little unsure the effect this might have on our workflows. Also a little unsure what to put here but I'll see if I can play around with it for a bit to figure something out.

agilgur5 commented 2 months ago

I was looking upstream and did find https://github.com/aws/aws-sdk-go/issues/2914 which has a number of workarounds and happens to have a back link from https://github.com/argoproj/argo-workflows/issues/10122 too.

My guess is that some configuration is being interpreted differently on ARM vs amd64, maybe an encoding difference or something 🤔

It seems like I need to add this same eks annotation to the server pod

Yes, your Server needs to authenticate to S3 as well if you want to view artifacts in the UI, although it only needs read access. That doc is being worked on too as it's overbroad right now while simultaneously not having enough details: https://github.com/argoproj/argo-workflows/pull/12467#discussion_r1752287700 / https://github.com/argoproj/argo-workflows/pull/13425#pullrequestreview-2214101470

wesleyhaws commented 2 months ago

🤔 I have tried:

server:
    podAnnotations:
      "eks.amazonaws.com/role-arn": arn:aws:iam::<redacted>:role/my-artifact-role

With no luck. I have edited the Trust Policy of that role to include the rbac service account. I also tried it with the service account argo-workflows-testing-role. Still same error :(

EDIT: Here is the configurations I have tried in the trust policy for the my-artifact-role role:

{
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::<redacted>:oidc-provider/oidc.eks.<redacted>.amazonaws.com/id/<redacted>"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "oidc.eks.<redacted>.amazonaws.com/id/<redacted>:sub": "system:serviceaccount:argo-workflows:argo-workflows-server"
                }
            }
        },

I have tried the service account associated with the argo workflows server pod (argo-workflows-server) and the service account associated with what rbac says it's giving me in the logs.

EDIT2: After re-reading the above I think I'm having a brain fart. This annotation needs to go on the argo workflows server's service account, NOT the pod. I'll try to re-adjust.

EDIT3: Applied this annotation to both the rbac service account and the server service account with no change, same error.

server:
  serviceAccount:
    annotations:
      "eks.amazonaws.com/role-arn": arn:aws:iam::<redacted>:role/my-artifact-role
wesleyhaws commented 2 months ago

I think I need some more backgrounds on how authentication is being done. There is a lot of confusing setups happening here between init, server, controller, wait, and main containers.

How is the authentication flow happening?

I have tried many different configurations above and still can't seem to get things working. If I had a little bit more background I think that might help me debug some more.

agilgur5 commented 1 month ago

There is a lot of confusing setups happening here between init, server, controller, wait, and main containers.

You may have over-complicated it 😅

The init and wait containers are responsible for input and output artifacts respectively. The main container just shares the same filesystem as the other two. This part I described previously above. The executor works this way so that it (usually) doesn't require changes to the user container in order to interact with Argo (including things like artifacts and deadlines/timeouts, stops, etc)

The Server just reads artifacts if you use them in the UI or via API, as described above.

The Controller does not interact with artifacts on its own, but reads artifact configurations and makes sure that the init, wait, and artifact GC containers all have a copy of the configuration they need when their respective Pods are created, mainly through env vars and ensuring volumes exist. This code is located in workflowpod.go

How is the authentication flow happening?

It depends entirely on your provider and authentication method. In this case, it is delegated to the S3 Go SDK. The configuration is passed through to it.

tooptoop4 commented 1 week ago

have u set https://github.com/argoproj/argo-workflows/blob/f73e7f9f6f70beafbbd0fbb49870336698f612af/docs/workflow-controller-configmap.yaml#L145-L149 ? or https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/dad9553f87db843b5f0ee7fe19aa547d439b3351/docs/README.md?plain=1#L252

https://github.com/argoproj/argo-workflows/issues/7513 also shows some AWS env variables and links