Etcd get full with ~500 workflows.

Pre-requisites

[X] I have double-checked my configuration
[ ] I can confirm the issue exists when I tested with :latest
[X] I have searched existing issues and could not find a match for this bug
[ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

We run ~500 workflows and ~500 pods concurrently as offline tasks in prod env. Etcd got full rapidly at the size of 8G. It resulted in that etcd and apiserver turned into unavailable and the argo workflow controller auto restarted frequently. Our team concluded that etcd and apiserver may be unavailable if running and pending workflows flood into etcd according to monitoring and metrics.

For now, the team’s solutions are:

Limiting workflows quotas
Optimizing the size of workflow template rendored by biz
Writing scripts to check and compress etcd if full as a schedule task
Migrating biz argo into another cluster alone

It is expected that argo workflows do not flood into etcd or impact the stability of whole cluster.

Version

v3.4.10

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Limited by NDA.

Logs from the workflow controller

time="2024-03-06T01:59:48.872Z" level=info msg="Mark node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62[0].xxxxxxx[1].subtask(20:raw/587/2023/12/6/1734100703048318977/ros2/20231206104000_20231206104046_5m/raw_587_20231206104000_20231206104046.db3)[2].xxxxxxx[0].xxxxxxx[0].xxxxxxx(0) as Pending, due to: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=822, limited: count/pods=600" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.875Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2759181252 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3992814895]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.872Z" level=info msg="Transient error: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=822, limited: count/pods=600"
time="2024-03-06T01:59:48.187Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="slice-587-20231201141424-20240128125944-20240201143506-fen7hx62[0].xxxxxxx[1].subtask(21:raw/587/2024/1/26/1751125289026650113/ros2/20240126140500_20240126141000_5m/raw_587_20240126140500_20240126141000.db3)[2].xxxxxxx[0].xxxxxxx[1].vision2d-lidar-fusion-match(0)" nodePhase=Pending recentlyStarted=false workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.185Z" level=info msg="Processing workflow" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.195Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-814885821 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2761991884]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.875Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-765220508 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3024955863]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.206Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-239402416 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-193874387]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.873Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3590128111 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.187Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="slice-587-20231201141424-20240128125944-20240201143506-fen7hx62[0].xxxxxxx[1].subtask(20:raw/587/2023/12/6/1734100703048318977/ros2/20231206104000_20231206104046_5m/raw_587_20231206104000_20231206104046.db3)[2].xxxxxxx[0].xxxxxxx[0].xxxxxxx(0)" nodePhase=Pending recentlyStarted=false workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.186Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.206Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1845491224 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1265119851]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.872Z" level=info msg="node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944 message: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=822, limited: count/pods=600" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.872Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-285032592 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.873Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3624891064 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.187Z" level=info msg="node unchanged" nodeID=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2773247126
time="2024-03-06T01:59:49.647Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.624Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1031877465 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.623Z" level=info msg="Transient error: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1351688902\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=821, limited: count/pods=600"
time="2024-03-06T01:59:49.648Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.632Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.633Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62

Logs from in your workflow's wait container

N/A

We run ~500 workflows and ~500 pods concurrently

So ~2500 concurrent Pods total?

It is expected that argo workflows do not flood into etcd or impact the stability of whole cluster.

For many Workflows and large Workflows, it may indeed stress the k8s API and etcd. That's not really an Argo limitation, that's how k8s works with shared control plane resources.

There are a few features you may want to use that are well documented:

Tune rate limits and/or the DEFAULT_REQUEUE_TIME per https://argo-workflows.readthedocs.io/en/latest/running-at-massive-scale/#overwhelmed-kubernetes-api
- Tune --qps and --burst per https://argo-workflows.readthedocs.io/en/latest/scaling/#k8s-api-client-side-rate-limiting
Tune your TTLs per https://argo-workflows.readthedocs.io/en/latest/cost-optimisation/#limit-the-total-number-of-workflows-and-pods
- Tune the rate of clean up with --workflow-ttl-workers and --pod-cleanup-workers per https://argo-workflows.readthedocs.io/en/latest/scaling/#adding-goroutines-to-increase-concurrency
- Optionally archive Workflows deleted from k8s/etcd in a separate DB per https://argo-workflows.readthedocs.io/en/latest/workflow-archive/
Use synchronization features such as parallelism and mutexes and semaphores to limit the number of concurrent Workflows and tasks per https://argo-workflows.readthedocs.io/en/latest/synchronization/
Enable nodeStatusOffload to move the status subresource data out of etcd and to a separate DB per https://argo-workflows.readthedocs.io/en/latest/offloading-large-workflows/
- EDIT: Per below comment, can also set the environment variable ALWAYS_OFFLOAD_NODE_STATUS=true to do this for all Workflows, and not just those >1MB per https://argo-workflows.readthedocs.io/en/latest/environment-variables/

EDIT: Some less documented options include:

Per below comment, disabling nodeEvents in the ConfigMap to save space in etcd (at the cost of less available tracking): https://github.com/argoproj/argo-workflows/blob/026b14ea418ccd98025a1343fca463ca58b1bef0/docs/workflow-controller-configmap.yaml#L36-L42

Status:

Currently it is ~4000 pending workflows , ~1000 runing workflows , ~1000 pods for one biggest argo. And 4 argos of smaller scale, ignored.
The former cluster has >10 nodes for argo and ~180 nodes totally. Now we build a standalone cluster for argo.
Workflows behave different from business. Some of them finished in several seconds, while some run for a few hours.

Recent Efforts:

Standalone MySQL archiving has already been enabled in months.
Another probelm we solved is that if db got stuck with growing data when archiving, the workflow controller doesn't handle any request anymore. Arching should be asynchronized I think. Archiving goes slowly, or even stuck after argo_archived_workflows reached ~250G. We wrote a cronjob to daily delete argo_archived_workflows with workflows finished weeks ago to prevent archiving stuck.
--workflow-ttl-workers and --pod-cleanup-workers: It was attempted to be modified. It works but does not save etcd from stress.
Tunning parallelism and pod limitation is the primary way in past weeks. A lower limitation does not satisfy the business requirements. A rapid/jump change leads to etcd be unstable in my experiences.
We've urged developpers to reduces the size of workflow template from 200K to smaller ones.

Workflows behave different from business. Some of them finished in several seconds, while some run for a few hours.

Yea Workflows in general have a lot of diverse use-cases, so capacity planning can be challenging. Configurations that are ideal for short Workflows are not necessarily ideal for long Workflows, etc.

Arching should be asynchronized I think [sic]

Archiving is asynchronous. The entire Controller is async, it's all goroutines.

Archiving goes slowly, or even stuck after argo_archived_workflows reached ~250G.

This sounds like it might be getting CPU starved? Without detailed metrics etc it's pretty hard to dive into details.

It also sounds a bit like #11948, which was fixed in 3.4.14 and later. Not entirely the same though from the description (you have an etcd OOM vs a Controller OOM and your archive is growing vs your live Workflows).

[x] I can confirm the issue exists when I tested with :latest

v3.4.10

You also checked this box, but are not on latest. Please fill out the issue template accurately, those questions are asked for very good reasons.

We wrote a cronjob to daily delete argo_archived_workflows with workflows finished weeks ago to prevent archiving stuck.

You can use archiveTTL for this as a built-in option.

It works but does not save etcd from stress.

If you're creating as many Workflows as you're deleting, that sounds possible. Again, you didn't provide metrics, but those would be ideal to track when doing any sort of performance tuning.

A rapid/jump change leads to etcd be unstable in my experiences.

Are you on a managed k8s control plane provider? E.g. EKS, AKS, GKE, etc? Those try to auto-scale and have pre-set limits, so that can certainly happen. If you're using a self-managed control plane (e.g. kOps), you can vertically scale etcd and the rest of the k8s control plane (as well as horizontally scale to an extent, as etcd is fully consistent and so will eventually hit an upper bound).

We've urged developpers to reduces the size of workflow template from 200K to smaller ones. [sic]

I listed this in my previous comment -- nodeStatusOffload can help with this.

Sorry, I was limited by NDA, and I am going to expose more details. Configuration and thresholds vary in past months.

You can use archiveTTL for this as a built-in option.

Current archiveTTL is 7d.

Standalone MySQL instance quota: 60-80G mem and local nvme disk.

When the size of archived workflows within 30-45 days reached ~250G, queries and writing on table argo_archived_workflows go slowly. A single SQL deleting workflows cost 2-3 minutes. We attempted to by the table index and MySQL hint, but it does not effect evidently. So I rebuilt MySQL and added a new hacking cronjob mentioned before and now it runs stable.

Are you on a managed k8s control plane provider? E.g. EKS, AKS, GKE, etc? Those try to auto-scale and have pre-set limits, so that can certainly happen. If you're using a self-managed control plane (e.g. kOps), you can set vertically scale etcd and the rest of the k8s control plane (as well as horizontally scale to an extent, as etcd is fully consistent an so will eventually hit an upper bound).

Self-manager cluster:

Kubernetes v1.21.9
Kubesphere v3.2.1

We've enabled etcd compact and compression regularly if triggers by the DB size metrics now. That is a hack.

I listed this in my previous comment --nodeStatusOffload can help with this.

It is enabled. Related config I could expose:

Persistence:

connectionPool:
  maxIdleConns: 100
  maxOpenConns: 0
  connMaxLifetime: 0s
nodeStatusOffLoad: true
archive: true
archiveTTL: 7d

Workflow defaults:

spec:
  ttlStrategy:
    secondsAfterCompletion: 0
    secondsAfterSuccess: 0
    secondsAfterFailure: 0
  podGC:
    strategy: OnPodCompletion
  parallelism: 3

Workflow controller args

args:
  - '--configmap'
  - workflow-controller-configmap
  - '--executor-image'
  - 'xxxxx/argoexec:v3.4.10'
  - '--namespaced'
  - '--workflow-ttl-workers=8'      # 4->8
  - '--pod-cleanup-workers=32'  # 4->32
  - '--workflow-workers=64'        # 32->64
  - '--qps=50'
  - '--kube-api-burst=90'  # 60->90
  - '--kube-api-qps=60'    # 40->60

Executor config

imagePullPolicy: IfNotPresent
resources:
  requests:
    cpu: 10m
    memory: 64Mi
  limits:
    cpu: 1000m
    memory: 512Mi

There are some desensitized etcd and argo metrics screenshots, where the first one shows etcd db size varies rapidly, and the following one shows the count of workflows and pods in argo namespace at the same time.

Screenshot from 2024-03-22 09-37-36

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

@leryn1122 can u include graph of apiserver_storage_objects{resource="events"} ?

i'm facing same issue and raised https://github.com/argoproj/argo-workflows/issues/13042 + https://github.com/argoproj/argo-workflows/issues/13089

i wonder if ARGO_PROGRESS_PATCH_TICK_DURATION as 0 will help to make less PATCH events too

are u setting https://github.com/argoproj/argo-workflows/blob/026b14ea418ccd98025a1343fca463ca58b1bef0/docs/workflow-controller-configmap.yaml#L36-L42 to false too?

@leryn1122 can u include graph of apiserver_storage_objects{resource="events"} ?

i'm facing same issue and raised #13042 + #13089

i wonder if ARGO_PROGRESS_PATCH_TICK_DURATION as 0 will help to make less PATCH events too

are u setting

https://github.com/argoproj/argo-workflows/blob/026b14ea418ccd98025a1343fca463ca58b1bef0/docs/workflow-controller-configmap.yaml#L36-L42

to false too?

apiserver_storage_objects{resource="events"} ranges from 90k ~ 15k, with maximum 30k+, while the current cluster is only used to run argo workflows.
nodeEvents is enabled
I wrote an etcd-jdbc. It illustrates that:
- workflows.argoproj.io would be frequently patched at the same time when workflow status varies, which etcd version increase rapidly, e.g. a single workflow has 370+ version.
- Count of workflowtaskresult.argoproj.io would also increase rapidly, for a test argo cluster where I was tuning on, it has 35k+ entries.

Possible solutions: It works for my team for now. It is not guaranteed to be a nice solution for you.

Update your etcd to latest version, and increase db size.
Separate argo from your main business.
Use an external etcd instead of built-in one.

Oh I forgot to mention earlier, there is also the environment variable ALWAYS_OFFLOAD_NODE_STATUS that could help in this scenario as well

@leryn1122 can u see what exactly is being changed on workflows.argoproj.io/workflowtaskresult.argoproj.io ? also is it every 10 seconds?

argoproj / argo-workflows