Segmentation fault in `wait` container during Artifact upload

ljyanesm commented 2 months ago

Pre-requisites

[X] I have double-checked my configuration
[X] I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
[X] I have searched existing issues and could not find a match for this bug
[ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

One of the workflow tasks failed with the following stacktrace:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1e3455f]
goroutine 389 [running]:
github.com/argoproj/pkg/s3.generatePutTasks.func1.1({0xc000df0400, 0x72}, {0x0, 0x0}, {0xc0002b3fa0?, 0xffffffffffffff9c?})
    /go/pkg/mod/github.com/argoproj/pkg@v0.13.7-0.20230901113346-235a5432ec98/s3/s3.go:245 +0xdf
path/filepath.walk({0xc00108dd50, 0x63}, {0x37f0248, 0xc000dcf380}, 0xc0002b3fa0)
    /usr/local/go/src/path/filepath/path.go:497 +0x1d4
path/filepath.walk({0xc0009e95c0, 0x58}, {0x37f0248, 0xc000dcf2b0}, 0xc0002b3fa0)
    /usr/local/go/src/path/filepath/path.go:501 +0x257
path/filepath.walk({0xc0009e94a0, 0x52}, {0x37f0248, 0xc000dcf1e0}, 0xc0002b3fa0)
    /usr/local/go/src/path/filepath/path.go:501 +0x257
path/filepath.walk({0xc000eeb5e0, 0x42}, {0x37f0248, 0xc000c19ba0}, 0xc0002b3fa0)
    /usr/local/go/src/path/filepath/path.go:501 +0x257
path/filepath.Walk({0xc000eeb5e0, 0x42}, 0xc0004dd7a0)
    /usr/local/go/src/path/filepath/path.go:572 +0x66
github.com/argoproj/pkg/s3.generatePutTasks.func1()
    /go/pkg/mod/github.com/argoproj/pkg@v0.13.7-0.20230901113346-235a5432ec98/s3/s3.go:243 +0x68
created by github.com/argoproj/pkg/s3.generatePutTasks in goroutine 1
    /go/pkg/mod/github.com/argoproj/pkg@v0.13.7-0.20230901113346-235a5432ec98/s3/s3.go:242 +0xe5

The expected behaviour was for the Pod to complete successfully and all the artifacts deposited correctly.

We have not tested using :latest, as this issue happens in about 1 in 1000 pods within our production environment and it is not reproducible.

Version

v3.5.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

The issue has only been observed infrequently, I do not have a reproducible example.

Logs from the workflow controller

N/A

Logs from in your workflow's wait container

The only relevant logs are shared in the summary.

Joibel commented 2 months ago

https://github.com/argoproj/pkg/blob/235a5432ec982969e2e1987e66458b5a44c2ee6f/s3/s3.go#L245 - fi is being used without checking err. This should be fixed in pkg so it reports err rather than crashing.

@ljyanesm, is there a possibility that some of the artifacts or the directories containing them have unusual permissions or contents or are being modified whilst upload is being attempted?

ljyanesm commented 2 months ago

Thanks for adding this check. I am keen on, if possible, running a version of ArgoWF with only these changes in place. Do you have any advice for doing so?

I've been having a careful look through the workflows and have found:

This happens in about 1 in 1000 pods, always for the same type of task suggesting this is implicated somehow.
I'll need a bit of time to check what exactly is happening within the task with regards to your questions. Permissions are not unusual, but may be possible that the contents of the directory being uploaded is changing.

Joibel commented 2 months ago

To test this you'd need to build a custom argoexec image. Having checked out the argo-workflows code:

go get github.com/argoproj/pkg@s3-err-check
make argoexec-image

You'll then need to push this image to somewhere that your cluster can pull from and set up your workflow controller to use it with --executor-image.

I suggest we chat in slack if you're having problems with this.

agilgur5 commented 2 months ago

@Joibel we usually upload a test image somewhere (e.g. personal DockerHub) if folks can run a test

ljyanesm commented 1 month ago

We have made some changes to the workflows where tasks are fully independent. The error was most likely related to some delete operations on the path that was being uploaded as an artifact.

This was corrected by moving these files to a different location only available to the pod running the task.

@Joibel, @agilgur5, Do you think with the changes to the pkg repo, which should help identify the issue more readily next time is enough to close the ticket?

agilgur5 commented 1 month ago

(PR to update pkg in this repo still necessary)

argoproj / argo-workflows