Open wesleyhaws opened 2 months ago
This is a follow-up from Slack. The error is coming from the S3 client so I imagine it's an upstream bug
As I mentioned there, could you provide more environment info such as the ARM instance type you used and anything else (EKS version?)?
NOTE: On the below replace the
<replace_image>
with any image that has arm support, it doesn't really matter as it's happening for everything
Could you still give a public image here that you've confirmed has the issue? For instance, busybox
The easier and more minimal to reproduce, the better.
argoexec whatever is running the logic to assume a aws role to read and write log files to a target s3 bucket.
The init
container downloads artifacts and the wait
container uploads them. Those both run the argoexec
image and is where the error would come from, correct.
You don't have any artifacts defined in your Workflow but it sounds like you have archiveLogs: true
configured? That wasn't in the config snippet you provided, could you provide more of your config for repro purposes?
(Not sure if the
wait
container is created in another namespace, please advise.)
It's a sidecar to your Workflow, so it depends on where you ran the Workflow -- there's no metadata.namespace
specified in your WorkflowTemplate
and no corresponding Workflow, so I can't determine that myself
Since you have a TTL defined, you'll need to check the Pod's logs before the TTL. It will probably end with the same error message, but there might be some other useful debugging info before that
EKS Version: 1.29 Node Family: ["c6g", "c7g", "c8g"]
It's a hybrid environment so launch argo-workflows & argocd (along with all workflows) on graviton (ARM) instance types. The instances are running common roles that are in the "Trusted Relationships" of the above defined artifact role. I know the artifact role is setup properly because it is able to store and retrieve logs from the backend s3 bucket. The EC2 iam role applied to nodes are the same between x86 and arm nodes.
The helm definition is setup with:
argo-workflows-0-42-1:
fullnameOverride: argo-workflows
singleNamespace: false
workflow:
serviceAccount:
create: true
server:
serviceAccount:
create: true
ingress:
...
extraArgs:
...
sso:
...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: <node_group_selector_name>
operator: In
values:
- <target_node>
useDefaultArtifactRepo: true
useStaticCredentials: false
artifactRepository:
archiveLogs: true
s3:
bucket: my-target-s3-artifacts-bucket
endpoint: s3.amazonaws.com
region: my-region
roleARN: arn:aws:iam::<redacted>:role/my-artifact-role
controller:
workflowDefaults:
spec:
activeDeadlineSeconds: 7200 # workflow is killed after 2h
ttlStrategy:
secondsAfterCompletion: 172800 # workflow is deleted after 2d
metricsConfig:
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
release: ....
affinity:
...
I have included the pod logs above - but it was for the main pod. I'll try to rerun this and get the logs from the sidecard wait
container and edit the above to include them.
The S3 bucket doesn't have anything special other than a policy to prevent uploading unencrypted artifacts. Policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RequireSecureTransport",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::my-target-s3-artifacts-bucket",
"arn:aws:s3:::my-target-s3-artifacts-bucket/*"
],
"Condition": {
"NumericLessThan": {
"s3:TlsVersion": "1.2"
}
}
}
]
}
It also has the following all enabled:
Bucket owner enforced
Block public access to buckets and objects granted through new access control lists (ACLs)
Block public access to buckets and objects granted through any access control lists (ACLs)
Block public access to buckets and objects granted through new public bucket or access point policies
Block public and cross-account access to buckets and objects through any public bucket or access point policies
default SSE-S3 encryption
@agilgur5 I have included the wait
container logs.
NOTE: Here are the wait
container logs when running on x86
:
time="2024-09-11T17:05:10.173Z" level=info msg="Starting Workflow Executor" version=v3.5.10
time="2024-09-11T17:05:10.177Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-09-11T17:05:10.177Z" level=info msg="Executor initialized" deadline="2024-09-11 19:05:07 +0000 UTC" includeScriptOutput=false namespace=argo-workflows podName=trigger-test-s92vt templateName=trigger version="&Version{Version:v3.5.10,BuildDate:2024-08-01T05:11:25Z,GitCommit:25829927431d9a0f46d17b72ae74aedb8d700884,GitTag:v3.5.10,GitTreeState:clean,GoVersion:go1.21.12,Compiler:gc,Platform:linux/amd64,}"
time="2024-09-11T17:05:10.192Z" level=info msg="Starting deadline monitor"
time="2024-09-11T17:05:12.192Z" level=info msg="Main container completed" error="<nil>"
time="2024-09-11T17:05:12.192Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-09-11T17:05:12.192Z" level=info msg="No output parameters"
time="2024-09-11T17:05:12.192Z" level=info msg="No output artifacts"
time="2024-09-11T17:05:12.193Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: trigger-test-s92vt/trigger-test-s92vt/main.log"
time="2024-09-11T17:05:12.200Z" level=info msg="Creating minio client using assumed-role credentials" roleArn="arn:aws:iam::<redacted>:role/my-artifact-role"
time="2024-09-11T17:05:12.245Z" level=info msg="Saving file to s3" bucket=my-target-s3-artifacts-bucket endpoint=s3.amazonaws.com key=trigger-test-s92vt/trigger-test-s92vt/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-09-11T17:05:12.317Z" level=info msg="Save artifact" artifactName=main-logs duration=124.538014ms error="<nil>" key=trigger-test-s92vt/trigger-test-s92vt/main.log
time="2024-09-11T17:05:12.317Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-09-11T17:05:12.317Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-09-11T17:05:12.334Z" level=info msg="Alloc=8117 TotalAlloc=17868 Sys=29797 NumGC=5 Goroutines=12"
time="2024-09-11T17:05:12.340Z" level=info msg="Deadline monitor stopped"
Thanks for the extra details!
The helm definition is setup with
So you don't have workflow archiving set-up, just archive logs? Usually they're used together so just double-checking
Also reminder to use syntax highlighting in markdown code blocks, i.e. "```yaml". I edit them in for a lot of people 😅
NOTE: Here are the
wait
container logs when running onx86
:
Yep those are normal. I see you updated the description with the ARM variant which does have the error in the logs. Thanks!
Tried adding this to the service account this morning:
apiVersion: v1
kind: ServiceAccount
metadata:
name: argo-workflows-testing-role
namespace: argo-workflows
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::<redacted>:role/my-other-iam-role
automountServiceAccountToken: true
This actually makes the job "pass" but has an internal server error when attempting to access the logs.
Logs from wait
container:
time="2024-09-12T14:38:30.724Z" level=info msg="Starting Workflow Executor" version=v3.5.10
time="2024-09-12T14:38:30.728Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-09-12T14:38:30.728Z" level=info msg="Executor initialized" deadline="2024-09-12 16:38:28 +0000 UTC" includeScriptOutput=false namespace=argo-workflows podName=trigger-test-lhp4c templateName=trigger version="&Version{Version:v3.5.10,BuildDate:2024-08-01T05:12:26Z,GitCommit:25829927431d9a0f46d17b72ae74aedb8d700884,GitTag:v3.5.10,GitTreeState:clean,GoVersion:go1.21.12,Compiler:gc,Platform:linux/arm64,}"
time="2024-09-12T14:38:30.745Z" level=info msg="Starting deadline monitor"
time="2024-09-12T14:38:32.745Z" level=info msg="Main container completed" error="<nil>"
time="2024-09-12T14:38:32.745Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-09-12T14:38:32.745Z" level=info msg="No output parameters"
time="2024-09-12T14:38:32.745Z" level=info msg="No output artifacts"
time="2024-09-12T14:38:32.745Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: trigger-test-lhp4c/trigger-test-lhp4c/main.log"
time="2024-09-12T14:38:32.754Z" level=info msg="Creating minio client using assumed-role credentials" roleArn="arn:aws:iam::<redacted>:role/my-artifact-role"
time="2024-09-12T14:38:32.844Z" level=info msg="Saving file to s3" bucket=my-target-s3-artifacts-bucket endpoint=s3.amazonaws.com key=trigger-test-lhp4c/trigger-test-lhp4c/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-09-12T14:38:32.955Z" level=info msg="Save artifact" artifactName=main-logs duration=210.087779ms error="<nil>" key=trigger-test-lhp4c/trigger-test-lhp4c/main.log
time="2024-09-12T14:38:32.955Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-09-12T14:38:32.955Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-09-12T14:38:32.969Z" level=info msg="Alloc=8054 TotalAlloc=17586 Sys=29029 NumGC=5 Goroutines=12"
time="2024-09-12T14:38:32.975Z" level=info msg="Deadline monitor stopped"
From the above logs it appears to have stored logs into s3 properly. So I looked and yes it is there. So then I looked at the argo-workflows-server pod logs:
time="2024-09-12T14:40:00.729Z" level=info msg="Get artifact file" artifactName=main-logs namespace=argo-workflows nodeId=trigger-test-lhp4c workflowName=trigger-test-lhp4c
time="2024-09-12T14:40:00.742Z" level=info msg="Creating minio client using assumed-role credentials" roleArn="arn:aws:iam::<redacted>:role/my-artifact-role"
time="2024-09-12T14:40:03.881Z" level=info msg="Check if directory" artifactName=main-logs duration=3.142196956s error="NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors" key=trigger-test-lhp4c/trigger-test-lhp4c/main.log
time="2024-09-12T14:40:03.881Z" level=error msg="Artifact Server returned internal error" error="NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors"
time="2024-09-12T14:40:03.881Z" level=info duration=3.157595799s method=GET path=/artifact-files/argo-workflows/workflows/trigger-test-lhp4c/trigger-test-lhp4c/outputs/main-logs size=22 status=500
time="2024-09-12T14:40:05.696Z" level=info msg="selected SSO RBAC service account for user" email=<redacted> loginServiceAccount=<redacted> serviceAccount=<redacted> ssoDelegated=false ssoDelegationAllowed=false subject=00u1xd1tgaNMdgPNo697
time="2024-09-12T14:40:05.696Z" level=info msg="Get artifact file" artifactName=main-logs namespace=argo-workflows nodeId=trigger-test-lhp4c workflowName=trigger-test-lhp4c
time="2024-09-12T14:40:05.704Z" level=info msg="Creating minio client using assumed-role credentials" roleArn="arn:aws:iam::<redacted>:role/my-artifact-role"
time="2024-09-12T14:40:06.165Z" level=info duration="82.058µs" method=GET path=index.html size=487 status=0
time="2024-09-12T14:40:08.853Z" level=info msg="Check if directory" artifactName=main-logs duration=3.152147996s error="NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors" key=trigger-test-lhp4c/trigger-test-lhp4c/main.log
time="2024-09-12T14:40:08.853Z" level=error msg="Artifact Server returned internal error" error="NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors"
time="2024-09-12T14:40:08.853Z" level=info duration=3.161819171s method=GET path=/artifact-files/argo-workflows/workflows/trigger-test-lhp4c/trigger-test-lhp4c/outputs/main-logs size=22 status=500
time="2024-09-12T14:40:26.164Z" level=info duration="74.739µs" method=GET path=index.html size=487 status=0
Now as you can see from the above the server fails to retrieve these logs even though it seems to know what role to assume? Weird. Is there a certain configuration on the server that I should try? There seems to be something about the functionality here that is changing between x86 and arm. Very strange.
I just found this in the docs. It seems like I need to add this same eks annotation to the server pod: https://argo-workflows.readthedocs.io/en/latest/configure-artifact-repository/#aws-s3-irsa
A little unsure the effect this might have on our workflows. Also a little unsure what to put here but I'll see if I can play around with it for a bit to figure something out.
I was looking upstream and did find https://github.com/aws/aws-sdk-go/issues/2914 which has a number of workarounds and happens to have a back link from https://github.com/argoproj/argo-workflows/issues/10122 too.
My guess is that some configuration is being interpreted differently on ARM vs amd64, maybe an encoding difference or something 🤔
It seems like I need to add this same eks annotation to the server pod
Yes, your Server needs to authenticate to S3 as well if you want to view artifacts in the UI, although it only needs read access. That doc is being worked on too as it's overbroad right now while simultaneously not having enough details: https://github.com/argoproj/argo-workflows/pull/12467#discussion_r1752287700 / https://github.com/argoproj/argo-workflows/pull/13425#pullrequestreview-2214101470
🤔 I have tried:
server:
podAnnotations:
"eks.amazonaws.com/role-arn": arn:aws:iam::<redacted>:role/my-artifact-role
With no luck. I have edited the Trust Policy of that role to include the rbac service account. I also tried it with the service account argo-workflows-testing-role
. Still same error :(
EDIT: Here is the configurations I have tried in the trust policy for the my-artifact-role
role:
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::<redacted>:oidc-provider/oidc.eks.<redacted>.amazonaws.com/id/<redacted>"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.<redacted>.amazonaws.com/id/<redacted>:sub": "system:serviceaccount:argo-workflows:argo-workflows-server"
}
}
},
I have tried the service account associated with the argo workflows server pod (argo-workflows-server
) and the service account associated with what rbac says it's giving me in the logs.
EDIT2: After re-reading the above I think I'm having a brain fart. This annotation needs to go on the argo workflows server's service account, NOT the pod. I'll try to re-adjust.
EDIT3: Applied this annotation to both the rbac service account and the server service account with no change, same error.
server:
serviceAccount:
annotations:
"eks.amazonaws.com/role-arn": arn:aws:iam::<redacted>:role/my-artifact-role
I think I need some more backgrounds on how authentication is being done. There is a lot of confusing setups happening here between init, server, controller, wait, and main containers.
How is the authentication flow happening?
I have tried many different configurations above and still can't seem to get things working. If I had a little bit more background I think that might help me debug some more.
There is a lot of confusing setups happening here between init, server, controller, wait, and main containers.
You may have over-complicated it 😅
The init
and wait
containers are responsible for input
and output
artifacts respectively. The main
container just shares the same filesystem as the other two. This part I described previously above.
The executor works this way so that it (usually) doesn't require changes to the user container in order to interact with Argo (including things like artifacts and deadlines/timeouts, stops, etc)
The Server just reads artifacts if you use them in the UI or via API, as described above.
The Controller does not interact with artifacts on its own, but reads artifact configurations and makes sure that the init
, wait
, and artifact GC containers all have a copy of the configuration they need when their respective Pods are created, mainly through env vars and ensuring volumes exist. This code is located in workflowpod.go
How is the authentication flow happening?
It depends entirely on your provider and authentication method. In this case, it is delegated to the S3 Go SDK. The configuration is passed through to it.
have u set https://github.com/argoproj/argo-workflows/blob/f73e7f9f6f70beafbbd0fbb49870336698f612af/docs/workflow-controller-configmap.yaml#L145-L149 ? or https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/dad9553f87db843b5f0ee7fe19aa547d439b3351/docs/README.md?plain=1#L252
https://github.com/argoproj/argo-workflows/issues/7513 also shows some AWS env variables and links
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
When running a working workflow template on a ARM based architecture node, vs a x86 node, it will fail to retrieve logs and fail the workflow with the error:
The workflow will launch and run successfully but when the logs are attempting to be placed into s3 it fails.
NOTE: The works 100% fine with x86 architectures. I tried it with no configuration changes between arm and x86 (amd64) node architectures.
To me this appears like it's an issue with the configuration in argoexec or whatever is running the logic to assume a aws role to read and write log files to a target s3 bucket.
Logs from the pod itself:
Version(s)
Helm chart: "0.42.1", quay.io/argoproj/argocli:v3.5.10, quay.io/argoproj/argocli@sha256:ccc55a32c9739f6d3e7649fe8b896ea90980fae95c4d318a43610dc80d20ddf9
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
kubectl logs deployment/argo-workflows-workflow-controller --tail=100 | grep trigger-test
Logs from in your workflow's wait container
kubectl logs -n argo-workflows trigger-test-bnzpr -c wait
k logs deployment/argo-workflows-server --tail=300 | grep trigger-test