Succeeded workflows are not cleaned even if TTL is set 0

CiprianAnton commented 1 year ago

Pre-requisites

[X] I have double-checked my configuration
[X] I can confirm the issues exists when I tested with :latest
[ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I've noticed this behavior in both v3.4.5 and v3.4.7. After Argo controller restarts, there is a point (aprox after 20 mins) where controller is not cleaning workflows for a temporary amount of time. Workflows stay in Succeeded state for aprox 15mins, after that time cleanup gets resumed.

After this timeframe, everything seems to go back to normal, succeeded workflows being cleaned up immediately. I've noticed this to happen only once after controller gets restarted.

The configuration we use for TTL:

    ttlStrategy:
      secondsAfterFailure: 86400
      secondsAfterSuccess: 0

I also have a graph that shows evolution of workflows in cluster:

We don't use artifactGC, I've excluded https://github.com/argoproj/argo-workflows/issues/10840

Version

v3.4.7

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
      You can also run it in Python: https://couler-proj.github.io/couler/examples/#hello-world
spec:
  entrypoint: whalesay
  ttlStrategy:
    secondsAfterSuccess: 0
    secondsAfterFailure: 86400
  securityContext:
    runAsNonRoot: true
    runAsUser: 8737 #; any non-root user
  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]

PowerShell script that creates workflows

$maximumNumberOfWorkflowsToSchedule = 10
$numberOfWorkflowsToScheduleAtOnce = 4
$namespace = "default"

while ($true)
{
    $currentWorkflows = &kubectl get workflows --no-headers -n $namespace
    $numberOfCurrentWorkflows = ($currentWorkflows | Measure-Object -Line).Lines
    Write-Host "Number of workflows in cluster: $numberOfCurrentWorkflows"

    if ($numberOfCurrentWorkflows -le $maximumNumberOfWorkflowsToSchedule)
    {
        for ($i = 0; $i -lt $numberOfWorkflowsToScheduleAtOnce; $i++)
        {
            &argo submit -n $namespace ./hello-world.yaml
        }
    }
    else
    {
        Write-Host "Too many workflows in cluster. Check succeeded workflows are cleaned up."
    }
    Start-Sleep -Seconds 5
}

Logs from the workflow controller

I have some logs:

Line 11652: time="2023-04-20T11:40:06.810Z" level=info msg="Queueing Succeeded workflow default/94f2ce40-879f-41eb-a767-9b7cb93b541b-79xxp for delete in 0s due to TTL"
Line 16234: time="2023-04-20T11:56:39.390Z" level=info msg="Queueing Succeeded workflow default/f05c891b-3532-4197-9910-c6e7bc17c4e0-42l4v for delete in -15m52s due to TTL"
Line 16236: time="2023-04-20T11:56:39.391Z" level=info msg="Queueing Succeeded workflow default/7a41f47a-b60b-48f5-b7e4-9dc43ab5d7c6-5mzcs for delete in -19m5s due to TTL"
Line 16238: time="2023-04-20T11:56:39.392Z" level=info msg="Queueing Succeeded workflow default/ff42bc93-5145-4d84-9c78-a8a3f89108a5-2rkqd for delete in -18m52s due to TTL"
Line 16240: time="2023-04-20T11:56:39.393Z" level=info msg="Queueing Succeeded workflow default/ab5f6918-21da-4d5f-80f8-d4b18efa4df2-4fmvk for delete in -19m12s due to TTL"
Line 16242: time="2023-04-20T11:56:39.394Z" level=info msg="Queueing Succeeded workflow default/ccc7633b-2784-4f5a-91e1-fd279d4cfe97-psr8n for delete in -18m7s due to TTL"
Line 16244: time="2023-04-20T11:56:39.395Z" level=info msg="Queueing Succeeded workflow default/f4b9cc14-2c28-422d-bfbe-1ab4c41727a3-vbdkl for delete in -17m3s due to TTL"
Line 16246: time="2023-04-20T11:56:39.396Z" level=info msg="Queueing Succeeded workflow default/cc55b8c5-efdb-40d7-ae32-64d32c210336-k9ldv for delete in -16m3s due to TTL"
Line 16248: time="2023-04-20T11:56:39.396Z" level=info msg="Queueing Succeeded workflow default/b4f01c20-bfb7-4def-b7d4-5418c343df5d-gnpth for delete in -18m3s due to TTL"
Line 16250: time="2023-04-20T11:56:39.397Z" level=info msg="Queueing Succeeded workflow default/1cdbdfff-454b-43e0-b08a-1eb6268a4ff1-td8d6 for delete in -18m2s due to TTL"
Line 16252: time="2023-04-20T11:56:39.398Z" level=info msg="Queueing Succeeded workflow default/becf0184-066f-47c9-9f86-110b66165190-mxt96 for delete in -18m10s due to TTL"
Line 16254: time="2023-04-20T11:56:39.399Z" level=info msg="Queueing Succeeded workflow default/877c8e87-5898-4651-87b4-9f66442b7075-2v54f for delete in -17m5s due to TTL"

Logs from in your workflow's wait container

Don't have.

sarabala1979 commented 1 year ago

@CiprianAnton Can you check k8s api log to make sure delete call from workflow controller is succeeded?

CiprianAnton commented 1 year ago

After Queueing Succeeded workflow default/877c8e87-5898-4651-87b4-9f66442b7075-2v54f for delete in -17m5s due to TTL appears in controller logs, the workflow gets deleted from the cluster. Kubectl and argoserver won't show it again.

CiprianAnton commented 1 year ago

I can constantly reproduce this, on multiple clusters. I'm attaching a workflow based on hello-world example and a PowerShell script that schedules workflows continuously.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
      You can also run it in Python: https://couler-proj.github.io/couler/examples/#hello-world
spec:
  entrypoint: whalesay
  ttlStrategy:
    secondsAfterSuccess: 0
    secondsAfterFailure: 86400
  securityContext:
    runAsNonRoot: true
    runAsUser: 8737 #; any non-root user
  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]

PowerShell script

$maximumNumberOfWorkflowsToSchedule = 10
$numberOfWorkflowsToScheduleAtOnce = 4
$namespace = "default"

while ($true)
{
    $currentWorkflows = &kubectl get workflows --no-headers -n $namespace
    $numberOfCurrentWorkflows = ($currentWorkflows | Measure-Object -Line).Lines
    Write-Host "Number of workflows in cluster: $numberOfCurrentWorkflows"

    if ($numberOfCurrentWorkflows -le $maximumNumberOfWorkflowsToSchedule)
    {
        for ($i = 0; $i -lt $numberOfWorkflowsToScheduleAtOnce; $i++)
        {
            &argo submit -n $namespace ./hello-world.yaml
        }
    }
    else
    {
        Write-Host "Too many workflows in cluster. Check succeeded workflows are cleaned up."
    }
    Start-Sleep -Seconds 5
}

After aprox 20 minutes since the Argo controller started, this should reproduce. Remember to restart controller in order to reproduce this.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

CiprianAnton commented 1 year ago

Update: the issue reproduces for failed workflows as well, I don't think the state matters. Pod cleanup is also affected. Based on the logs like Queueing Succeeded workflow default/877c8e87-5898-4651-87b4-9f66442b7075-2v54f for delete in -17m5s due to TTL, I think the problem is at enqueue side, workflows are not not scheduled to be cleaned.

I'm inclined to believe the issue comes from the k8s client. There is also an hardcoded workflowResyncPeriod of 20 minutes, which justifies how the issue self-heal in time.

I'm not familiar with the codebase, some question for others that might know (like @terrytangyuan):

what implications would have decreasing the resync period? Is it a heavy operation?
I see that wfInformer.Run is being run twice on the same object, once from controller.go and once from gc_controller.go. Is this use case valid? Some warning is being reported W1018 10:40:37.539223 1 shared_informer.go:401] The sharedIndexInformer has started, run more than once is not allowed

Problem also reproduces on 3.4.11 Comparing argo v.3.38 with 3.4.11, the k8s.io/client-go has changed from v0.23.3 to v0.24.3

CiprianAnton commented 1 year ago

It might also worth upgrading client-go package to a newer version.

CiprianAnton commented 1 year ago

I can confirm the issue was introduced in https://github.com/argoproj/argo-workflows/commit/39b7f91392c4c0a0a7c167b5ad7c89b1382df68d, when k8s.io/client-go was upgraded from v0.23.5 to v0.24.3

terrytangyuan commented 1 year ago

It's been there for a while so we may just wait for fix in https://github.com/kubernetes/kubernetes/issues/127964

tooptoop4 commented 2 months ago

@CiprianAnton did u find workaround (downgrade the k8s client, shorten 20m workflowresyncperiod, increase RECENTLY_STARTED_POD_DURATION, set INFORMER_WRITE_BACK to false, something else)? i think i'm facing similar issue in https://github.com/argoproj/argo-workflows/issues/13671 i get The sharedIndexInformer has started, run more than once is not allowed in logs too

CiprianAnton commented 2 months ago

@tooptoop4 The workaround I used was to just ignore those succeeded workflows. After 20 minutes it will self heal. This issue comes from the k8s go client and happens once after the pod restarted.

argoproj / argo-workflows