Sidecars not terminated in a timely manner at scale when main container completes

bencompton commented 1 year ago

Pre-requisites

[X] I have double-checked my configuration
[X] I can confirm the issues exists when I tested with :latest
[ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

What happened:

When running 10 instances of a workflow that spins up 200 parallel pods with sidecars (2000 total pods), some of the pods don't complete until several minutes after the main container completes (witnessed up to 30+ minute delay). Instead, the main and wait containers complete, but the sidecars continue running afterwards.

Expectation:

The sidecars should be terminated (or at least receive a terminate signal) within seconds after the main container completes.

Version

3.4.5

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: sidecar-test-
spec:
  entrypoint: main
  templates:
    - name: main
      dag:
        tasks:
          - name: sidecar-test
            template: sidecar-test
            withSequence:
              count: "200"
    - name: sidecar-test
      container:
        image: debian:bullseye-slim
        resources:
          requests:
            memory: 500Mi
            cpu: 100m
        command: [bash, -c]
        args:
          - |
            sleep 10
      sidecars:
        - name: sidecar-1
          image: debian:bullseye-slim
          resources:
            requests:
              memory: 500Mi
              cpu: 100m
          command: [bash, -c]
          args:
            - |
              while :
              do
                sleep 5
              done
        - name: sidecar-2
          image: debian:bullseye-slim
          resources:
            requests:
              memory: 500Mi
              cpu: 100m
          command: [bash, -c]
          args:
            - |
              while :
              do
                sleep 5
              done
        - name: sidecar-3
          image: debian:bullseye-slim
          resources:
            requests:
              memory: 500Mi
              cpu: 100m
          command: [bash, -c]
          args:
            - |
              while :
              do
                sleep 5
              done

Logs from the workflow controller

workflow-controller.log

Logs from in your workflow's wait container

time="2023-02-27T18:00:21 UTC" level=info msg="Starting deadline monitor"
time="2023-02-27T18:01:22 UTC" level=info msg="Main container completed" error="<nil>"
time="2023-02-27T18:01:22 UTC" level=info msg="Deadline monitor stopped"
time="2023-02-27T18:01:22 UTC" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2023-02-27T18:01:22 UTC" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-02-27T18:01:22 UTC" level=info msg="No output parameters"
time="2023-02-27T18:01:22 UTC" level=info msg="No output artifacts"
time="2023-02-27T18:01:22 UTC" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: sidecar-test-h5f6f/sidecar-test-h5f6f-sidecar-test-1545037156/main.log"
time="2023-02-27T18:01:22 UTC" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-02-27T18:01:22 UTC" level=info msg="Saving file to s3" bucket=ci-platform-us01-ci-platform-prod-artifacts endpoint=s3.amazonaws.com key=sidecar-test-h5f6f/sidecar-tes
t-h5f6f-sidecar-test-1545037156/main.log path=/tmp/argo/outputs/logs/main.log
time="2023-02-27T18:01:22 UTC" level=info msg="Save artifact" artifactName=main-logs duration=95.712696ms error="<nil>" key=sidecar-test-h5f6f/sidecar-test-h5f6f-sidecar-test-15
45037156/main.log
time="2023-02-27T18:01:22 UTC" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-02-27T18:01:22 UTC" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-02-27T18:01:22 UTC" level=info msg="Create workflowtaskresults 201"
time="2023-02-27T18:01:22 UTC" level=info msg="Alloc=11112 TotalAlloc=18408 Sys=35282 NumGC=4 Goroutines=11"

Additional context

Environment: AWS EKS, running Karpenter with c6i.32xlarge instances

When the issue occurs, pods look like this:

...
sidecar-test-njbkt-sidecar-test-1683720170                        3/5     NotReady      0               15m
sidecar-test-njbkt-sidecar-test-1749094478                        3/5     NotReady      0               16m
sidecar-test-njbkt-sidecar-test-2256319574                        3/5     NotReady      0               15m
sidecar-test-njbkt-sidecar-test-3413532318                        3/5     NotReady      0               15m
sidecar-test-njbkt-sidecar-test-616410606                         3/5     NotReady      0               16m
sidecar-test-njbkt-sidecar-test-839388366                         3/5     NotReady      0               15m
sidecar-test-r9wpq-sidecar-test-1116727110                        3/5     NotReady      0               15m
sidecar-test-r9wpq-sidecar-test-1866103240                        3/5     NotReady      0               16m
sidecar-test-r9wpq-sidecar-test-1911252918                        3/5     NotReady      0               15m
sidecar-test-r9wpq-sidecar-test-2550845942                        3/5     NotReady      0               15m
sidecar-test-r9wpq-sidecar-test-269508782                         3/5     NotReady      0               15m
sidecar-test-r9wpq-sidecar-test-2869810302                        3/5     NotReady      0               15m
sidecar-test-r9wpq-sidecar-test-3276810206                        3/5     NotReady      0               16m
...

Containers:
  wait:
...
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 28 Feb 2023 08:47:04 -0700
      Finished:     Tue, 28 Feb 2023 08:47:15 -0700
    Ready:          False
...
  main:
...   
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 28 Feb 2023 08:47:04 -0700
      Finished:     Tue, 28 Feb 2023 08:47:14 -0700
    Ready:          False
...
  sidecar-1:
...   
    State:          Running
      Started:      Tue, 28 Feb 2023 08:47:05 -0700
    Ready:          True
    Restart Count:  0
...
  sidecar-2:
...   
    State:          Running
      Started:      Tue, 28 Feb 2023 08:47:06 -0700
    Ready:          True
...
  sidecar-3:
...      
    State:          Running
      Started:      Tue, 28 Feb 2023 08:47:07 -0700
    Ready:          True

describe-pod.log

Notes

Can reproduce this issue by running a single instance of the workflow. I tested a single instance in a completely separate, smaller cluster and noted that some pods have a duration of < 1m while others run over 3m, with with main container usually completing within seconds and the sidecars still running for minutes afterwards.
Just had a single instance of this workflow take 21m when running in the original, larger cluster. These absurdly long runtimes seem to occur after running 10 concurrent instances (runtime is usually ~1m). The pods were all running within 30s, and the main containers were completing quickly.

bencompton commented 1 year ago

I discovered that having the main container write signal files for the sidecars with SIGTERM messages is a way to work around this issue. I updated the main container with the following code to mimic what the wait container does:

sleep 10

echo MTU= | base64 -d > /var/run/argo/ctr/sidecar-1/signal
echo MTU= | base64 -d > /var/run/argo/ctr/sidecar-2/signal
echo MTU= | base64 -d > /var/run/argo/ctr/sidecar-3/signal

With this workaround, 10 instances of the example workflow are finishing within a few minutes instead of 20m, with most pods having a duration of < 20s and few having a duration slightly over 1m.

It would appear that without this hack, something is preventing the wait container from writing the signal files in a timely manner after the main container completes.

caelan-io commented 1 year ago

Thanks a lot for posting this issue w/ logs, a repro workflow, and even proposing a workaround for the time being. ❤️

We're doing some digging of our own into workflow hanging that might be related this (see: https://github.com/argoproj/argo-workflows/issues/10491). We'll post an update on here if we run into a cause or fix.

caelan-io commented 1 year ago

Also, FYI for SIGTERM related issues: https://github.com/argoproj/argo-workflows/issues/10518 PR: https://github.com/argoproj/argo-workflows/pull/10523

bencompton commented 1 year ago

Thanks for the info @caelan-io! Hmm, I wonder if #10523 fixes this issue.

caelan-io commented 1 year ago

If you weren't having this issue until 3.4.5, then that will likely fix it. If you're able to test that PR or master with your repro workflow, please let us know the results.

JPZ13 commented 1 year ago

Hey @bencompton - have you had a chance to test out if #10523 fixes this issue? If it does, we'll go ahead and close it and see when we can get another patch release out

bencompton commented 1 year ago

My team just updated to 3.4.7 and I re-tested. Unfortunately, I’m still seeing the same issue with the sidecars not terminating in a timely manner. In my team’s workflows, I saw pods with the main containers completing after 20m and continuing until hitting our 1h deadline while the sidecars failed to stop. When re-testing with the minimal reproduction above, I see the same results as before:

…
sidecar-test-vsf8r-sidecar-test-3583951812                       3/5     NotReady    0          14m
sidecar-test-vsf8r-sidecar-test-3593789946                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-359583538                        3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-364746458                        3/5     NotReady    0          12m
sidecar-test-vsf8r-sidecar-test-3674744714                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-3714523468                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-3745816844                       3/5     NotReady    0          14m
sidecar-test-vsf8r-sidecar-test-3748422672                       3/5     NotReady    0          14m
sidecar-test-vsf8r-sidecar-test-3749200898                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-3758682656                       3/5     NotReady    0          14m
sidecar-test-vsf8r-sidecar-test-3773040474                       3/5     NotReady    0          12m
sidecar-test-vsf8r-sidecar-test-3804027916                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-3866083034                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-386948370                        3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-3873634500                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-3876696458                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-3880561570                       3/5     NotReady    0          14m
sidecar-test-vsf8r-sidecar-test-3926852610                       3/5     NotReady    0          12m
…

  main:
    Container ID:  containerd://2f38531a9cd7e0d9213240be0899692cef963d1c7ced83cc926e7baff156f4fb
    Image:         debian:bullseye-slim
    Image ID:      docker.io/library/debian@sha256:f4da3f9b18fc242b739807a0fb3e77747f644f2fb3f67f4403fafce2286b431a
    Port:          <none>
    Host Port:     <none>
    Command:
      /var/run/argo/argoexec
      emissary
      --loglevel
      info
      --log-format
      text
      --
      bash
      -c
    Args:
      sleep 10

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 03 May 2023 11:38:35 -0600
      Finished:     Wed, 03 May 2023 11:38:45 -0600
    Ready:          False
…
  sidecar-1:
    Container ID:  containerd://afedb72e64ef9ec369e3bf2a89382216e7b83d030bb28310b2d9c41096220a4f
    Image:         debian:bullseye-slim
    Image ID:      docker.io/library/debian@sha256:f4da3f9b18fc242b739807a0fb3e77747f644f2fb3f67f4403fafce2286b431a
    Port:          <none>
    Host Port:     <none>
    Command:
      /var/run/argo/argoexec
      emissary
      --loglevel
      info
      --log-format
      text
      --
      bash
      -c
    Args:
      while :
      do
        sleep 5
      done

    State:          Running
      Started:      Wed, 03 May 2023 11:38:35 -0600
    Ready:          True
…
  sidecar-2:
    Container ID:  containerd://38d014aef53551e1e1d58b310e0640783ab42347282d789a517a68d7fa292ac7
    Image:         debian:bullseye-slim
    Image ID:      docker.io/library/debian@sha256:f4da3f9b18fc242b739807a0fb3e77747f644f2fb3f67f4403fafce2286b431a
    Port:          <none>
    Host Port:     <none>
    Command:
      /var/run/argo/argoexec
      emissary
      --loglevel
      info
      --log-format
      text
      --
      bash
      -c
    Args:
      while :
      do
        sleep 5
      done

    State:          Running
      Started:      Wed, 03 May 2023 11:38:35 -0600
    Ready:          True
…
  sidecar-3:
    Container ID:  containerd://924cd7afe76ec251b3100862b9b5a8e85d822636c86ab1a1d24e234f402e1577
    Image:         debian:bullseye-slim
    Image ID:      docker.io/library/debian@sha256:f4da3f9b18fc242b739807a0fb3e77747f644f2fb3f67f4403fafce2286b431a
    Port:          <none>
    Host Port:     <none>
    Command:
      /var/run/argo/argoexec
      emissary
      --loglevel
      info
      --log-format
      text
      --
      bash
      -c
    Args:
      while :
      do
        sleep 5
      done

    State:          Running
      Started:      Wed, 03 May 2023 11:38:35 -0600
    Ready:          True

FYI: @JPZ13 @caelan-io @jmeridth

McPonolith commented 1 year ago

Also running into this issue on 3.3.10, 3.4.0, 3.4.7, 3.4.8 (Ive ran into multiple issues to do with permissions when installing from fresh, so I wonder if this is a similar issue)

@bencompton workaround, works, but obviously not ideal (Thanks alot though!), and leads credence that its a permission issue. Are any of the maintainers running into this issue as well, as its very easily reproducible for me?

Ive noticed the pods are being created outside of the argo namespace. This created an issue for me when setting up an artifact repo as the documentation was creating credentials in the argo namespace, but they weren't accessible. Theres also been a few other similar issues. (Im new to K8S so this may be perfectly normal, and the documentation was wrong). Though that wouldn't explain why it works outside of DAG/Steps.

This works, as theres no DAG or Steps

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: sidecar-
spec:
  entrypoint: sidecar-example
  templates:
  - name: sidecar-example
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["
        apk update &&
        apk add curl &&
        until curl -XPOST 'http://127.0.0.1:8086/query' --data-urlencode 'q=CREATE DATABASE mydb' ; do sleep .5; done &&
        for i in $(seq 1 20); 
          do curl -XPOST 'http://127.0.0.1:8086/write?db=mydb' -d \"cpu,host=server01,region=uswest load=$i\" ; 
          sleep .5 ; 
        done
      "]
    sidecars:
    - name: influxdb
      image: influxdb:1.2
      command: [influxd]

The following does not work

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: workflow-example-
spec:
  entrypoint: workflow
  templates:
  - name: workflow
    dag:
      tasks:
      - name: sidecar-test
        inline:
          container:
            image: alpine:latest
            command: [sh, -c]
            args: ["
              apk update &&
              apk add curl &&
              until curl -XPOST 'http://127.0.0.1:8086/query' --data-urlencode 'q=CREATE DATABASE mydb' ; do sleep .5; done &&
              for i in $(seq 1 20); 
                do curl -XPOST 'http://127.0.0.1:8086/write?db=mydb' -d \"cpu,host=server01,region=uswest load=$i\" ; 
                sleep .5 ; 
              done
            "]
          sidecars:
            - name: influxdb
              image: influxdb:1.2
              command: [influxd]

This does work, due to the workaround

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: workflow-example-
spec:
  entrypoint: workflow
  templates:
  - name: workflow
    dag:
      tasks:
      - name: sidecar-test
        inline:
          container:
            image: alpine:latest
            command: [sh, -c]
            args: ["
              apk update &&
              apk add curl &&
              until curl -XPOST 'http://127.0.0.1:8086/query' --data-urlencode 'q=CREATE DATABASE mydb' ; do sleep .5; done &&
              for i in $(seq 1 20); 
                do curl -XPOST 'http://127.0.0.1:8086/write?db=mydb' -d \"cpu,host=server01,region=uswest load=$i\" ; 
                sleep .5 ; 
              done
              && echo MTU= | base64 -d > /var/run/argo/ctr/influxdb/signal # WITHOUT THIS COMMAND, THE SIDECAR REMAINS RUNNING
            "]
          sidecars:
            - name: influxdb
              image: influxdb:1.2
              command: [influxd]

caelan-io commented 1 year ago

Thank you for posting updates on this issue and confirming the workaround works @bencompton @McPonolith

We have several other bug fixes ahead of this in the priority queue. If anyone has further suggestions for a solution, please comment and/or take this on in the meantime.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

terrytangyuan commented 8 months ago

Not stale. This is a real bug. Contributions welcomed!

henrywangx commented 1 month ago

We face this issue also. Looks like we change "Kill Sidecar" from distributed to centralized, may be easier to face the scale issue.

https://github.com/argoproj/argo-workflows/pull/8475/files

gmweaver commented 1 month ago

I believe I have also encountered this issue with an auto-injected custom istio sidecar we use, call it custom-istio.

When I try the suggested solution:

sleep 10
echo MTU= | base64 -d > /var/run/argo/ctr/custom-istio/signal

I get /var/run/argo/ctr/custom-istio/ folder does not exist. Is this expected?

EDIT: Ah, injected == Argo not aware / explicit in yaml, which means that this workaround most likely will not work

argoproj / argo-workflows