Open bencompton opened 1 year ago
I discovered that having the main container write signal files for the sidecars with SIGTERM messages is a way to work around this issue. I updated the main container with the following code to mimic what the wait container does:
sleep 10
echo MTU= | base64 -d > /var/run/argo/ctr/sidecar-1/signal
echo MTU= | base64 -d > /var/run/argo/ctr/sidecar-2/signal
echo MTU= | base64 -d > /var/run/argo/ctr/sidecar-3/signal
With this workaround, 10 instances of the example workflow are finishing within a few minutes instead of 20m, with most pods having a duration of < 20s and few having a duration slightly over 1m.
It would appear that without this hack, something is preventing the wait container from writing the signal files in a timely manner after the main container completes.
Thanks a lot for posting this issue w/ logs, a repro workflow, and even proposing a workaround for the time being. ❤️
We're doing some digging of our own into workflow hanging that might be related this (see: https://github.com/argoproj/argo-workflows/issues/10491). We'll post an update on here if we run into a cause or fix.
Also, FYI for SIGTERM related issues: https://github.com/argoproj/argo-workflows/issues/10518 PR: https://github.com/argoproj/argo-workflows/pull/10523
Thanks for the info @caelan-io! Hmm, I wonder if #10523 fixes this issue.
If you weren't having this issue until 3.4.5, then that will likely fix it. If you're able to test that PR or master with your repro workflow, please let us know the results.
Hey @bencompton - have you had a chance to test out if #10523 fixes this issue? If it does, we'll go ahead and close it and see when we can get another patch release out
My team just updated to 3.4.7 and I re-tested. Unfortunately, I’m still seeing the same issue with the sidecars not terminating in a timely manner. In my team’s workflows, I saw pods with the main containers completing after 20m and continuing until hitting our 1h deadline while the sidecars failed to stop. When re-testing with the minimal reproduction above, I see the same results as before:
…
sidecar-test-vsf8r-sidecar-test-3583951812 3/5 NotReady 0 14m
sidecar-test-vsf8r-sidecar-test-3593789946 3/5 NotReady 0 13m
sidecar-test-vsf8r-sidecar-test-359583538 3/5 NotReady 0 13m
sidecar-test-vsf8r-sidecar-test-364746458 3/5 NotReady 0 12m
sidecar-test-vsf8r-sidecar-test-3674744714 3/5 NotReady 0 13m
sidecar-test-vsf8r-sidecar-test-3714523468 3/5 NotReady 0 13m
sidecar-test-vsf8r-sidecar-test-3745816844 3/5 NotReady 0 14m
sidecar-test-vsf8r-sidecar-test-3748422672 3/5 NotReady 0 14m
sidecar-test-vsf8r-sidecar-test-3749200898 3/5 NotReady 0 13m
sidecar-test-vsf8r-sidecar-test-3758682656 3/5 NotReady 0 14m
sidecar-test-vsf8r-sidecar-test-3773040474 3/5 NotReady 0 12m
sidecar-test-vsf8r-sidecar-test-3804027916 3/5 NotReady 0 13m
sidecar-test-vsf8r-sidecar-test-3866083034 3/5 NotReady 0 13m
sidecar-test-vsf8r-sidecar-test-386948370 3/5 NotReady 0 13m
sidecar-test-vsf8r-sidecar-test-3873634500 3/5 NotReady 0 13m
sidecar-test-vsf8r-sidecar-test-3876696458 3/5 NotReady 0 13m
sidecar-test-vsf8r-sidecar-test-3880561570 3/5 NotReady 0 14m
sidecar-test-vsf8r-sidecar-test-3926852610 3/5 NotReady 0 12m
…
main:
Container ID: containerd://2f38531a9cd7e0d9213240be0899692cef963d1c7ced83cc926e7baff156f4fb
Image: debian:bullseye-slim
Image ID: docker.io/library/debian@sha256:f4da3f9b18fc242b739807a0fb3e77747f644f2fb3f67f4403fafce2286b431a
Port: <none>
Host Port: <none>
Command:
/var/run/argo/argoexec
emissary
--loglevel
info
--log-format
text
--
bash
-c
Args:
sleep 10
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 03 May 2023 11:38:35 -0600
Finished: Wed, 03 May 2023 11:38:45 -0600
Ready: False
…
sidecar-1:
Container ID: containerd://afedb72e64ef9ec369e3bf2a89382216e7b83d030bb28310b2d9c41096220a4f
Image: debian:bullseye-slim
Image ID: docker.io/library/debian@sha256:f4da3f9b18fc242b739807a0fb3e77747f644f2fb3f67f4403fafce2286b431a
Port: <none>
Host Port: <none>
Command:
/var/run/argo/argoexec
emissary
--loglevel
info
--log-format
text
--
bash
-c
Args:
while :
do
sleep 5
done
State: Running
Started: Wed, 03 May 2023 11:38:35 -0600
Ready: True
…
sidecar-2:
Container ID: containerd://38d014aef53551e1e1d58b310e0640783ab42347282d789a517a68d7fa292ac7
Image: debian:bullseye-slim
Image ID: docker.io/library/debian@sha256:f4da3f9b18fc242b739807a0fb3e77747f644f2fb3f67f4403fafce2286b431a
Port: <none>
Host Port: <none>
Command:
/var/run/argo/argoexec
emissary
--loglevel
info
--log-format
text
--
bash
-c
Args:
while :
do
sleep 5
done
State: Running
Started: Wed, 03 May 2023 11:38:35 -0600
Ready: True
…
sidecar-3:
Container ID: containerd://924cd7afe76ec251b3100862b9b5a8e85d822636c86ab1a1d24e234f402e1577
Image: debian:bullseye-slim
Image ID: docker.io/library/debian@sha256:f4da3f9b18fc242b739807a0fb3e77747f644f2fb3f67f4403fafce2286b431a
Port: <none>
Host Port: <none>
Command:
/var/run/argo/argoexec
emissary
--loglevel
info
--log-format
text
--
bash
-c
Args:
while :
do
sleep 5
done
State: Running
Started: Wed, 03 May 2023 11:38:35 -0600
Ready: True
FYI: @JPZ13 @caelan-io @jmeridth
Also running into this issue on 3.3.10, 3.4.0, 3.4.7, 3.4.8 (Ive ran into multiple issues to do with permissions when installing from fresh, so I wonder if this is a similar issue)
@bencompton workaround, works, but obviously not ideal (Thanks alot though!), and leads credence that its a permission issue. Are any of the maintainers running into this issue as well, as its very easily reproducible for me?
Ive noticed the pods are being created outside of the argo namespace. This created an issue for me when setting up an artifact repo as the documentation was creating credentials in the argo namespace, but they weren't accessible. Theres also been a few other similar issues. (Im new to K8S so this may be perfectly normal, and the documentation was wrong). Though that wouldn't explain why it works outside of DAG/Steps.
This works, as theres no DAG or Steps
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: sidecar-
spec:
entrypoint: sidecar-example
templates:
- name: sidecar-example
container:
image: alpine:latest
command: [sh, -c]
args: ["
apk update &&
apk add curl &&
until curl -XPOST 'http://127.0.0.1:8086/query' --data-urlencode 'q=CREATE DATABASE mydb' ; do sleep .5; done &&
for i in $(seq 1 20);
do curl -XPOST 'http://127.0.0.1:8086/write?db=mydb' -d \"cpu,host=server01,region=uswest load=$i\" ;
sleep .5 ;
done
"]
sidecars:
- name: influxdb
image: influxdb:1.2
command: [influxd]
The following does not work
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: workflow-example-
spec:
entrypoint: workflow
templates:
- name: workflow
dag:
tasks:
- name: sidecar-test
inline:
container:
image: alpine:latest
command: [sh, -c]
args: ["
apk update &&
apk add curl &&
until curl -XPOST 'http://127.0.0.1:8086/query' --data-urlencode 'q=CREATE DATABASE mydb' ; do sleep .5; done &&
for i in $(seq 1 20);
do curl -XPOST 'http://127.0.0.1:8086/write?db=mydb' -d \"cpu,host=server01,region=uswest load=$i\" ;
sleep .5 ;
done
"]
sidecars:
- name: influxdb
image: influxdb:1.2
command: [influxd]
This does work, due to the workaround
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: workflow-example-
spec:
entrypoint: workflow
templates:
- name: workflow
dag:
tasks:
- name: sidecar-test
inline:
container:
image: alpine:latest
command: [sh, -c]
args: ["
apk update &&
apk add curl &&
until curl -XPOST 'http://127.0.0.1:8086/query' --data-urlencode 'q=CREATE DATABASE mydb' ; do sleep .5; done &&
for i in $(seq 1 20);
do curl -XPOST 'http://127.0.0.1:8086/write?db=mydb' -d \"cpu,host=server01,region=uswest load=$i\" ;
sleep .5 ;
done
&& echo MTU= | base64 -d > /var/run/argo/ctr/influxdb/signal # WITHOUT THIS COMMAND, THE SIDECAR REMAINS RUNNING
"]
sidecars:
- name: influxdb
image: influxdb:1.2
command: [influxd]
Thank you for posting updates on this issue and confirming the workaround works @bencompton @McPonolith
We have several other bug fixes ahead of this in the priority queue. If anyone has further suggestions for a solution, please comment and/or take this on in the meantime.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Not stale. This is a real bug. Contributions welcomed!
We face this issue also. Looks like we change "Kill Sidecar" from distributed to centralized, may be easier to face the scale issue.
I believe I have also encountered this issue with an auto-injected custom istio sidecar we use, call it custom-istio
.
When I try the suggested solution:
sleep 10
echo MTU= | base64 -d > /var/run/argo/ctr/custom-istio/signal
I get /var/run/argo/ctr/custom-istio/
folder does not exist. Is this expected?
EDIT: Ah, injected == Argo not aware / explicit in yaml, which means that this workaround most likely will not work
Pre-requisites
:latest
What happened/what you expected to happen?
What happened:
When running 10 instances of a workflow that spins up 200 parallel pods with sidecars (2000 total pods), some of the pods don't complete until several minutes after the main container completes (witnessed up to 30+ minute delay). Instead, the main and wait containers complete, but the sidecars continue running afterwards.
Expectation:
The sidecars should be terminated (or at least receive a terminate signal) within seconds after the main container completes.
Version
3.4.5
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
workflow-controller.log
Logs from in your workflow's wait container
Additional context
Environment: AWS EKS, running Karpenter with
c6i.32xlarge
instancesWhen the issue occurs, pods look like this:
describe-pod.log
Notes
Can reproduce this issue by running a single instance of the workflow. I tested a single instance in a completely separate, smaller cluster and noted that some pods have a duration of < 1m while others run over 3m, with with main container usually completing within seconds and the sidecars still running for minutes afterwards.
Just had a single instance of this workflow take 21m when running in the original, larger cluster. These absurdly long runtimes seem to occur after running 10 concurrent instances (runtime is usually ~1m). The pods were all running within 30s, and the main containers were completing quickly.