Argo fails to upload larger output artifacts with sidecar service mesh like LinkerD

HasithaAthukorala commented 7 months ago

Pre-requisites

[X] I have double-checked my configuration
[X] I can confirm the issue exists when I tested with :latest
[X] I have searched existing issues and could not find a match for this bug
[ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

Bug:

I have added a pod annotation to kill the linkerd-proxy container - workflows.argoproj.io/kill-cmd-linkerd-proxy: ["/usr/lib/linkerd/linkerd-await", "--shutdown", "--", "sleep", "--help"]
Above mentioned command is executed just after the main container is completed, so, the wait container is only having the grace period to complete it's uploading of output artifacts to the S3
When the output artifacts are heavier or there is a lot of output artifacts to be uploaded, the wait container is stuck in the middle after the grace period while uploading the artifacts. This happens due to the death of the linkerd-proxy container

Ideal Solution:

The command to kill the linkerd-proxy container, should be executed after the wait container

Version

v3.5.5 (latest)

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

I used the Golang SDK to publish pipelines

You can use any workflow this
LinkerD should be installed in the cluster
Output artifacts should be heavier

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

HasithaAthukorala commented 7 months ago

I tried this with the latest version available (v3.5.5), but the issue is still there, hence I opened this issue But, I could solve this using sidecars - https://argo-workflows.readthedocs.io/en/latest/walk-through/sidecars/

Below is the hack I did to solve this (Hope this will help someone with the same issue):

Implemented a sidecar to do the following things

Modify the pod annotation for Linkerd - I have configured the workflow to add a pod annotation to kill the linkerd-proxy container. (workflows.argoproj.io/kill-cmd-linkerd-proxy: ["/usr/lib/linkerd/linkerd-await", "--shutdown", "--", "sleep", "--help"]). I can't remove this annotation from the workflow, since this annotation is not only for the template pod where this new sidecar runs, but also for the finalizer pod, so, If remove this annotation the workflow will keep running since the linkerd-proxy container is never dying. Hence I added a logic to the sidecar to modify the pod annotation only in the template pod
Modify the pod annotation for this new sidecar - Sidecars are killed just after the main container is completed. But, we can override those killing commands by adding another pod annotation. Here is the pod annotation I add - workflows.argoproj.io/kill-cmd-{name of the sidecar}: ["echo"]
Watch for the status of the wait container
Call the shutdown hook of the linkerd-proxy once the wait container is completed - curl -X POST http://localhost:4191/shutdown
Complete the self pod

NOTE: You can take a look at the code here - https://github.com/HasithaAthukorala/argo-sidecar

agilgur5 commented 7 months ago

Noting that this is a follow-up to this Slack thread

Above mentioned command is executed just after the main container is completed, so, the wait container is only having the grace period to complete it's uploading of output artifacts to the S3

I tried root causing this in the thread (see links to the source code there), but couldn't quite find where this happens.

HasithaAthukorala commented 7 months ago

@agilgur5 Is this related to that? I feel like this is the place where all the custom pod annotations are executed (killing all the sidecars). https://github.com/argoproj/argo-workflows/blob/0b625a61a3473f9045d35a6bc0ae8a99db73e2ca/workflow/controller/controller.go#L362-L363

agilgur5 commented 7 months ago

I mean, at a really high-level yes, but you have to go down several layers of the call stack to get to SignalContainer which actually runs the kill command.

The question is why is that occurring, as normally containers are only killed when the Workflow has been stopped or cancelled, or the step has exceed its deadline. In the codebase, that is when Pods are queued up with action terminateContainers or killContainers.

I suspect it is deeper than that; there seems to be an assumption somewhere that when the main container finishes, all containers should be killed and that wait executes after main is killed.

The solution to that seems non-trivial, the first kill should only take out main (and maybe others if you have sidecars or a containerSet?) and then there needs to be a second kill after wait finishes to get the remaining containers.

agilgur5 commented 7 months ago

I suspect it is deeper than that; there seems to be an assumption somewhere that when the main container finishes, all containers should be killed and that wait executes after main is killed.

Ok I found that assumption in the codebase which basically says that in the comments: https://github.com/argoproj/argo-workflows/blob/7c04653543cdfd48e6f2ec96fc4bb87779ed2773/workflow/controller/operator.go#L1504 This originates from the injected sidecar support PR #5383, which did check for the wait container specifically. Then a check for the main container was added in #6033. Then the wait check was moved in #8478

That code still exists, so this is possibly a regression from #8478? I also wonder if changes from #11484 makes that no longer necessary.

cc @alexec and @terrytangyuan -- this has gone through a lot of iterations, so I am hesitant to change it myself if that would fix one regression while causing another regression. This needs a really careful touch and both of you have been involved with it already

HasithaAthukorala commented 7 months ago

Yes, I believe this is happening due to the feature, Sidecar Injection - https://argo-workflows.readthedocs.io/en/latest/sidecar-injection/

To solve this issue, I used above feature as a hack to inject another sidecar to kill the linkerd-proxy container. So, I could see that, all the sidecars are killed just after the main container is completed. That is also mentioned in the documentation as well - https://argo-workflows.readthedocs.io/en/latest/sidecar-injection/#:~:text=It%20can%20be%20impossible%20for%20the%20wait%20container%20to%20terminate%20the%20sidecar.

agilgur5 commented 7 months ago

Your comment is not quite correct, so I will correct it below. But it's not related to my root cause analysis above either, it's just a misunderstanding of the features.

Yes, I believe this is happening due to the feature, Sidecar Injection - https://argo-workflows.readthedocs.io/en/latest/sidecar-injection/

Er, that's the feature that makes using Sidecar Injection with Argo even work in the first place. Without it, Pods hang.

There's just a bug in it.

To solve this issue, I used above feature as a hack to inject another sidecar

As you wrote previously, you used regular Sidecars: https://argo-workflows.readthedocs.io/en/latest/walk-through/sidecars/. "Injected" means that they're added by a different controller, LinkerD in your case, and not Argo. You didn't "inject" anything other than LinkerD, you just added a regular sidecar.

The "injection" is what makes service mesh sidecars cause bugs with lots of different tools, like Argo. The tools don't know about the injected sidecar as they didn't add them themselves, something else "injected" them. Often that means the tool now has to support them somehow in a wacky way, which is what Argo tries to do. Injected sidecars are a hack in and of themselves, and so often require more hacks to get working. That's why Istio's Ambient Mesh is a game changer, as it doesn't require injected sidecars, resolving a mountain of issues like these.

You did use the kill command feature, which is normally for injected sidecars, as a hack to allow your new sidecar to keep running. That is a clever workaround.

all the sidecars are killed just after the main container is completed. That is also mentioned in the documentation as well - https://argo-workflows.readthedocs.io/en/latest/sidecar-injection/#:~:text=It%20can%20be%20impossible%20for%20the%20wait%20container%20to%20terminate%20the%20sidecar.

That's not what that quote says. It says that the wait container itself cannot necessarily kill other containers. Your LinkerD sidecar isn't being killed by the wait container, it's being killed by the Argo Controller.

The interesting thing here is that you showed that a sidecar can kill it in this case since it's available over the local network, which you did in your workaround. The wait container is itself a sidecar and could potentially do something similar.

argoproj / argo-workflows