Open ljyanesm opened 2 months ago
Related Slack conversation: https://cloud-native.slack.com/archives/C01QW9QSSSK/p1719237267885449
https://github.com/argoproj/pkg/blob/235a5432ec982969e2e1987e66458b5a44c2ee6f/s3/s3.go#L245 - fi
is being used without checking err
. This should be fixed in pkg so it reports err rather than crashing.
@ljyanesm, is there a possibility that some of the artifacts or the directories containing them have unusual permissions or contents or are being modified whilst upload is being attempted?
Thanks for adding this check. I am keen on, if possible, running a version of ArgoWF with only these changes in place. Do you have any advice for doing so?
I've been having a careful look through the workflows and have found:
To test this you'd need to build a custom argoexec image. Having checked out the argo-workflows code:
go get github.com/argoproj/pkg@s3-err-check
make argoexec-image
You'll then need to push this image to somewhere that your cluster can pull from and set up your workflow controller to use it with --executor-image
.
I suggest we chat in slack if you're having problems with this.
@Joibel we usually upload a test image somewhere (e.g. personal DockerHub) if folks can run a test
We have made some changes to the workflows where tasks are fully independent. The error was most likely related to some delete operations on the path that was being uploaded as an artifact.
This was corrected by moving these files to a different location only available to the pod running the task.
@Joibel, @agilgur5, Do you think with the changes to the pkg repo, which should help identify the issue more readily next time is enough to close the ticket?
(PR to update pkg
in this repo still necessary)
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened/what did you expect to happen?
One of the workflow tasks failed with the following stacktrace:
The expected behaviour was for the Pod to complete successfully and all the artifacts deposited correctly.
We have not tested using
:latest
, as this issue happens in about 1 in 1000 pods within our production environment and it is not reproducible.Version
v3.5.8
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container