Open CiprianAnton opened 1 year ago
@CiprianAnton Can you check k8s api log to make sure delete call from workflow controller is succeeded?
After Queueing Succeeded workflow default/877c8e87-5898-4651-87b4-9f66442b7075-2v54f for delete in -17m5s due to TTL
appears in controller logs, the workflow gets deleted from the cluster. Kubectl and argoserver won't show it again.
I can constantly reproduce this, on multiple clusters. I'm attaching a workflow based on hello-world example and a PowerShell script that schedules workflows continuously.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: hello-world-
labels:
workflows.argoproj.io/archive-strategy: "false"
annotations:
workflows.argoproj.io/description: |
This is a simple hello world example.
You can also run it in Python: https://couler-proj.github.io/couler/examples/#hello-world
spec:
entrypoint: whalesay
ttlStrategy:
secondsAfterSuccess: 0
secondsAfterFailure: 86400
securityContext:
runAsNonRoot: true
runAsUser: 8737 #; any non-root user
templates:
- name: whalesay
container:
image: docker/whalesay:latest
command: [cowsay]
args: ["hello world"]
PowerShell script
$maximumNumberOfWorkflowsToSchedule = 10
$numberOfWorkflowsToScheduleAtOnce = 4
$namespace = "default"
while ($true)
{
$currentWorkflows = &kubectl get workflows --no-headers -n $namespace
$numberOfCurrentWorkflows = ($currentWorkflows | Measure-Object -Line).Lines
Write-Host "Number of workflows in cluster: $numberOfCurrentWorkflows"
if ($numberOfCurrentWorkflows -le $maximumNumberOfWorkflowsToSchedule)
{
for ($i = 0; $i -lt $numberOfWorkflowsToScheduleAtOnce; $i++)
{
&argo submit -n $namespace ./hello-world.yaml
}
}
else
{
Write-Host "Too many workflows in cluster. Check succeeded workflows are cleaned up."
}
Start-Sleep -Seconds 5
}
After aprox 20 minutes since the Argo controller started, this should reproduce. Remember to restart controller in order to reproduce this.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
Update: the issue reproduces for failed workflows as well, I don't think the state matters. Pod cleanup is also affected. Based on the logs like Queueing Succeeded workflow default/877c8e87-5898-4651-87b4-9f66442b7075-2v54f for delete in -17m5s due to TTL
, I think the problem is at enqueue side, workflows are not not scheduled to be cleaned.
I'm inclined to believe the issue comes from the k8s client. There is also an hardcoded workflowResyncPeriod
of 20 minutes, which justifies how the issue self-heal in time.
I'm not familiar with the codebase, some question for others that might know (like @terrytangyuan):
wfInformer.Run
is being run twice on the same object, once from controller.go
and once from gc_controller.go
. Is this use case valid? Some warning is being reported W1018 10:40:37.539223 1 shared_informer.go:401] The sharedIndexInformer has started, run more than once is not allowed
Problem also reproduces on 3.4.11
Comparing argo v.3.38 with 3.4.11, the k8s.io/client-go
has changed from v0.23.3
to v0.24.3
It might also worth upgrading client-go package to a newer version.
I can confirm the issue was introduced in https://github.com/argoproj/argo-workflows/commit/39b7f91392c4c0a0a7c167b5ad7c89b1382df68d, when k8s.io/client-go
was upgraded from v0.23.5 to v0.24.3
It's been there for a while so we may just wait for fix in https://github.com/kubernetes/kubernetes/issues/127964
@CiprianAnton did u find workaround (downgrade the k8s client, shorten 20m workflowresyncperiod, increase RECENTLY_STARTED_POD_DURATION
, set INFORMER_WRITE_BACK
to false, something else)? i think i'm facing similar issue in https://github.com/argoproj/argo-workflows/issues/13671 i get The sharedIndexInformer has started, run more than once is not allowed
in logs too
@tooptoop4 The workaround I used was to just ignore those succeeded workflows. After 20 minutes it will self heal. This issue comes from the k8s go client and happens once after the pod restarted.
Pre-requisites
:latest
What happened/what you expected to happen?
I've noticed this behavior in both v3.4.5 and v3.4.7. After Argo controller restarts, there is a point (aprox after 20 mins) where controller is not cleaning workflows for a temporary amount of time. Workflows stay in Succeeded state for aprox 15mins, after that time cleanup gets resumed.
After this timeframe, everything seems to go back to normal, succeeded workflows being cleaned up immediately. I've noticed this to happen only once after controller gets restarted.
The configuration we use for TTL:
I also have a graph that shows evolution of workflows in cluster:
We don't use artifactGC, I've excluded https://github.com/argoproj/argo-workflows/issues/10840
Version
v3.4.7
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
PowerShell script that creates workflows
Logs from the workflow controller
I have some logs:
Logs from in your workflow's wait container