argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.95k stars 3.19k forks source link

Occasional failure to create s3 client when loading artifacts #10122

Open louisnow opened 1 year ago

louisnow commented 1 year ago

Pre-requisites

What happened/what you expected to happen?

Expecting artifacts to load consistently in the pod. However we notice at least 1 failure a day randomly.

Error (exit code 1): artifact metadata failed to load: failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated. For verbose messaging see aws.Config.CredentialsChainVerboseErrors

Version

v3.3.9

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Any workflow pod that loads data from a store, s3 in our case.

It could be the internal library that's causing issues as well https://github.com/aws/aws-sdk-go/issues/2914 https://github.com/aws-observability/aws-otel-collector/issues/1286

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow} 

Will add these logs

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded 

Will add these logs
sarabala1979 commented 1 year ago

@louisnow it looks like your AWS credential provider is rate-limiting to add AWS credentials on your pods. Can you check your provider log?

louisnow commented 1 year ago

Thanks for the quick response @sarabala1979, I'll check and get back!

gauravkcldcvr commented 1 year ago

@sarabala1979 the authentication is done from the node instance role. The maximum session duration for that role is set to 1 hr, could this be a potential reason? As when the workflow creates a pod during this process it could be breaking while updating the AWS token for the pod due to timeout. (the workflow usually runs longer than an hour)

gauravkcldcvr commented 1 year ago

@sarabala1979 can you look at the above comment and help me understand if this is the valid reason

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.