Open HasithaAthukorala opened 7 months ago
I tried this with the latest version available (v3.5.5), but the issue is still there, hence I opened this issue But, I could solve this using sidecars - https://argo-workflows.readthedocs.io/en/latest/walk-through/sidecars/
Below is the hack I did to solve this (Hope this will help someone with the same issue):
Implemented a sidecar to do the following things
linkerd-proxy
container. (workflows.argoproj.io/kill-cmd-linkerd-proxy: ["/usr/lib/linkerd/linkerd-await", "--shutdown", "--", "sleep", "--help"]
). I can't remove this annotation from the workflow, since this annotation is not only for the template pod where this new sidecar runs, but also for the finalizer pod, so, If remove this annotation the workflow will keep running since the linkerd-proxy container is never dying. Hence I added a logic to the sidecar to modify the pod annotation only in the template podworkflows.argoproj.io/kill-cmd-{name of the sidecar}: ["echo"]
curl -X POST http://localhost:4191/shutdown
NOTE: You can take a look at the code here - https://github.com/HasithaAthukorala/argo-sidecar
Noting that this is a follow-up to this Slack thread
- Above mentioned command is executed just after the
main
container is completed, so, thewait
container is only having the grace period to complete it's uploading of output artifacts to the S3
I tried root causing this in the thread (see links to the source code there), but couldn't quite find where this happens.
@agilgur5 Is this related to that? I feel like this is the place where all the custom pod annotations are executed (killing all the sidecars). https://github.com/argoproj/argo-workflows/blob/0b625a61a3473f9045d35a6bc0ae8a99db73e2ca/workflow/controller/controller.go#L362-L363
I mean, at a really high-level yes, but you have to go down several layers of the call stack to get to SignalContainer
which actually runs the kill command.
The question is why is that occurring, as normally containers are only killed when the Workflow has been stopped or cancelled, or the step has exceed its deadline. In the codebase, that is when Pods are queued up with action terminateContainers
or killContainers
.
I suspect it is deeper than that; there seems to be an assumption somewhere that when the main
container finishes, all containers should be killed and that wait
executes after main
is killed.
The solution to that seems non-trivial, the first kill should only take out main
(and maybe others if you have sidecars
or a containerSet
?) and then there needs to be a second kill after wait
finishes to get the remaining containers.
I suspect it is deeper than that; there seems to be an assumption somewhere that when the
main
container finishes, all containers should be killed and thatwait
executes aftermain
is killed.
Ok I found that assumption in the codebase which basically says that in the comments: https://github.com/argoproj/argo-workflows/blob/7c04653543cdfd48e6f2ec96fc4bb87779ed2773/workflow/controller/operator.go#L1504
This originates from the injected sidecar support PR #5383, which did check for the wait
container specifically. Then a check for the main
container was added in #6033. Then the wait
check was moved in #8478
That code still exists, so this is possibly a regression from #8478? I also wonder if changes from #11484 makes that no longer necessary.
cc @alexec and @terrytangyuan -- this has gone through a lot of iterations, so I am hesitant to change it myself if that would fix one regression while causing another regression. This needs a really careful touch and both of you have been involved with it already
Yes, I believe this is happening due to the feature, Sidecar Injection - https://argo-workflows.readthedocs.io/en/latest/sidecar-injection/
To solve this issue, I used above feature as a hack to inject another sidecar to kill the linkerd-proxy
container. So, I could see that, all the sidecars are killed just after the main
container is completed. That is also mentioned in the documentation as well - https://argo-workflows.readthedocs.io/en/latest/sidecar-injection/#:~:text=It%20can%20be%20impossible%20for%20the%20wait%20container%20to%20terminate%20the%20sidecar.
Your comment is not quite correct, so I will correct it below. But it's not related to my root cause analysis above either, it's just a misunderstanding of the features.
Yes, I believe this is happening due to the feature, Sidecar Injection - https://argo-workflows.readthedocs.io/en/latest/sidecar-injection/
Er, that's the feature that makes using Sidecar Injection with Argo even work in the first place. Without it, Pods hang.
There's just a bug in it.
To solve this issue, I used above feature as a hack to inject another sidecar
As you wrote previously, you used regular Sidecars: https://argo-workflows.readthedocs.io/en/latest/walk-through/sidecars/. "Injected" means that they're added by a different controller, LinkerD in your case, and not Argo. You didn't "inject" anything other than LinkerD, you just added a regular sidecar.
The "injection" is what makes service mesh sidecars cause bugs with lots of different tools, like Argo. The tools don't know about the injected sidecar as they didn't add them themselves, something else "injected" them. Often that means the tool now has to support them somehow in a wacky way, which is what Argo tries to do. Injected sidecars are a hack in and of themselves, and so often require more hacks to get working. That's why Istio's Ambient Mesh is a game changer, as it doesn't require injected sidecars, resolving a mountain of issues like these.
You did use the kill command feature, which is normally for injected sidecars, as a hack to allow your new sidecar to keep running. That is a clever workaround.
all the sidecars are killed just after the main container is completed. That is also mentioned in the documentation as well - https://argo-workflows.readthedocs.io/en/latest/sidecar-injection/#:~:text=It%20can%20be%20impossible%20for%20the%20wait%20container%20to%20terminate%20the%20sidecar.
That's not what that quote says. It says that the wait
container itself cannot necessarily kill other containers. Your LinkerD sidecar isn't being killed by the wait
container, it's being killed by the Argo Controller.
The interesting thing here is that you showed that a sidecar can kill it in this case since it's available over the local network, which you did in your workaround.
The wait
container is itself a sidecar and could potentially do something similar.
Pre-requisites
:latest
What happened/what did you expect to happen?
Bug:
linkerd-proxy
container -workflows.argoproj.io/kill-cmd-linkerd-proxy: ["/usr/lib/linkerd/linkerd-await", "--shutdown", "--", "sleep", "--help"]
main
container is completed, so, thewait
container is only having the grace period to complete it's uploading of output artifacts to the S3linkerd-proxy
containerIdeal Solution:
linkerd-proxy
container, should be executed after thewait
containerVersion
v3.5.5 (latest)
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
I used the Golang SDK to publish pipelines
Logs from the workflow controller
Logs from in your workflow's wait container