argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.55k stars 3.12k forks source link

Etcd get full with ~500 workflows. #12802

Open leryn1122 opened 3 months ago

leryn1122 commented 3 months ago

Pre-requisites

What happened/what did you expect to happen?

We run ~500 workflows and ~500 pods concurrently as offline tasks in prod env. Etcd got full rapidly at the size of 8G. It resulted in that etcd and apiserver turned into unavailable and the argo workflow controller auto restarted frequently. Our team concluded that etcd and apiserver may be unavailable if running and pending workflows flood into etcd according to monitoring and metrics.

For now, the team’s solutions are:

It is expected that argo workflows do not flood into etcd or impact the stability of whole cluster.

Version

v3.4.10

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Limited by NDA.

Logs from the workflow controller

time="2024-03-06T01:59:48.872Z" level=info msg="Mark node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62[0].xxxxxxx[1].subtask(20:raw/587/2023/12/6/1734100703048318977/ros2/20231206104000_20231206104046_5m/raw_587_20231206104000_20231206104046.db3)[2].xxxxxxx[0].xxxxxxx[0].xxxxxxx(0) as Pending, due to: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=822, limited: count/pods=600" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.875Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2759181252 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3992814895]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.872Z" level=info msg="Transient error: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=822, limited: count/pods=600"
time="2024-03-06T01:59:48.187Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="slice-587-20231201141424-20240128125944-20240201143506-fen7hx62[0].xxxxxxx[1].subtask(21:raw/587/2024/1/26/1751125289026650113/ros2/20240126140500_20240126141000_5m/raw_587_20240126140500_20240126141000.db3)[2].xxxxxxx[0].xxxxxxx[1].vision2d-lidar-fusion-match(0)" nodePhase=Pending recentlyStarted=false workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.185Z" level=info msg="Processing workflow" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.195Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-814885821 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2761991884]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.875Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-765220508 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3024955863]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.206Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-239402416 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-193874387]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.873Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3590128111 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.187Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="slice-587-20231201141424-20240128125944-20240201143506-fen7hx62[0].xxxxxxx[1].subtask(20:raw/587/2023/12/6/1734100703048318977/ros2/20231206104000_20231206104046_5m/raw_587_20231206104000_20231206104046.db3)[2].xxxxxxx[0].xxxxxxx[0].xxxxxxx(0)" nodePhase=Pending recentlyStarted=false workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.186Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.206Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1845491224 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1265119851]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.872Z" level=info msg="node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944 message: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=822, limited: count/pods=600" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.872Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-285032592 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.873Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3624891064 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.187Z" level=info msg="node unchanged" nodeID=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2773247126
time="2024-03-06T01:59:49.647Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.624Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1031877465 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.623Z" level=info msg="Transient error: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1351688902\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=821, limited: count/pods=600"
time="2024-03-06T01:59:49.648Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.632Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.633Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62

Logs from in your workflow's wait container

N/A
agilgur5 commented 3 months ago

We run ~500 workflows and ~500 pods concurrently

So ~2500 concurrent Pods total?

It is expected that argo workflows do not flood into etcd or impact the stability of whole cluster.

For many Workflows and large Workflows, it may indeed stress the k8s API and etcd. That's not really an Argo limitation, that's how k8s works with shared control plane resources.

There are a few features you may want to use that are well documented:

EDIT: Some less documented options include:

leryn1122 commented 3 months ago

Status:

Recent Efforts:

agilgur5 commented 3 months ago
  • Workflows behave different from business. Some of them finished in several seconds, while some run for a few hours.

Yea Workflows in general have a lot of diverse use-cases, so capacity planning can be challenging. Configurations that are ideal for short Workflows are not necessarily ideal for long Workflows, etc.

Arching should be asynchronized I think [sic]

Archiving is asynchronous. The entire Controller is async, it's all goroutines.

Archiving goes slowly, or even stuck after argo_archived_workflows reached ~250G.

This sounds like it might be getting CPU starved? Without detailed metrics etc it's pretty hard to dive into details.

It also sounds a bit like #11948, which was fixed in 3.4.14 and later. Not entirely the same though from the description (you have an etcd OOM vs a Controller OOM and your archive is growing vs your live Workflows).

  • [x] I can confirm the issue exists when I tested with :latest

v3.4.10

You also checked this box, but are not on latest. Please fill out the issue template accurately, those questions are asked for very good reasons.

We wrote a cronjob to daily delete argo_archived_workflows with workflows finished weeks ago to prevent archiving stuck.

You can use archiveTTL for this as a built-in option.

It works but does not save etcd from stress.

If you're creating as many Workflows as you're deleting, that sounds possible. Again, you didn't provide metrics, but those would be ideal to track when doing any sort of performance tuning.

A rapid/jump change leads to etcd be unstable in my experiences.

Are you on a managed k8s control plane provider? E.g. EKS, AKS, GKE, etc? Those try to auto-scale and have pre-set limits, so that can certainly happen. If you're using a self-managed control plane (e.g. kOps), you can vertically scale etcd and the rest of the k8s control plane (as well as horizontally scale to an extent, as etcd is fully consistent and so will eventually hit an upper bound).

We've urged developpers to reduces the size of workflow template from 200K to smaller ones. [sic]

I listed this in my previous comment -- nodeStatusOffload can help with this.

leryn1122 commented 3 months ago

Sorry, I was limited by NDA, and I am going to expose more details. Configuration and thresholds vary in past months.

You can use archiveTTL for this as a built-in option.

Current archiveTTL is 7d.

Standalone MySQL instance quota: 60-80G mem and local nvme disk.

When the size of archived workflows within 30-45 days reached ~250G, queries and writing on table argo_archived_workflows go slowly. A single SQL deleting workflows cost 2-3 minutes. We attempted to by the table index and MySQL hint, but it does not effect evidently. So I rebuilt MySQL and added a new hacking cronjob mentioned before and now it runs stable.

Are you on a managed k8s control plane provider? E.g. EKS, AKS, GKE, etc? Those try to auto-scale and have pre-set limits, so that can certainly happen. If you're using a self-managed control plane (e.g. kOps), you can set vertically scale etcd and the rest of the k8s control plane (as well as horizontally scale to an extent, as etcd is fully consistent an so will eventually hit an upper bound).

Self-manager cluster:

We've enabled etcd compact and compression regularly if triggers by the DB size metrics now. That is a hack.

I listed this in my previous comment --nodeStatusOffload can help with this.

It is enabled. Related config I could expose:

Persistence:

connectionPool:
  maxIdleConns: 100
  maxOpenConns: 0
  connMaxLifetime: 0s
nodeStatusOffLoad: true
archive: true
archiveTTL: 7d

Workflow defaults:

spec:
  ttlStrategy:
    secondsAfterCompletion: 0
    secondsAfterSuccess: 0
    secondsAfterFailure: 0
  podGC:
    strategy: OnPodCompletion
  parallelism: 3

Workflow controller args

args:
  - '--configmap'
  - workflow-controller-configmap
  - '--executor-image'
  - 'xxxxx/argoexec:v3.4.10'
  - '--namespaced'
  - '--workflow-ttl-workers=8'      # 4->8
  - '--pod-cleanup-workers=32'  # 4->32
  - '--workflow-workers=64'        # 32->64
  - '--qps=50'
  - '--kube-api-burst=90'  # 60->90
  - '--kube-api-qps=60'    # 40->60

Executor config

imagePullPolicy: IfNotPresent
resources:
  requests:
    cpu: 10m
    memory: 64Mi
  limits:
    cpu: 1000m
    memory: 512Mi

There are some desensitized etcd and argo metrics screenshots, where the first one shows etcd db size varies rapidly, and the following one shows the count of workflows and pods in argo namespace at the same time.

Screenshot from 2024-03-22 09-37-36 Screenshot from 2024-03-22 09-37-39

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

tooptoop4 commented 1 month ago

@leryn1122 can u include graph of apiserver_storage_objects{resource="events"} ?

i'm facing same issue and raised https://github.com/argoproj/argo-workflows/issues/13042 + https://github.com/argoproj/argo-workflows/issues/13089

i wonder if ARGO_PROGRESS_PATCH_TICK_DURATION as 0 will help to make less PATCH events too

are u setting https://github.com/argoproj/argo-workflows/blob/026b14ea418ccd98025a1343fca463ca58b1bef0/docs/workflow-controller-configmap.yaml#L36-L42 to false too?

leryn1122 commented 2 weeks ago

@leryn1122 can u include graph of apiserver_storage_objects{resource="events"} ?

i'm facing same issue and raised #13042 + #13089

i wonder if ARGO_PROGRESS_PATCH_TICK_DURATION as 0 will help to make less PATCH events too

are u setting

https://github.com/argoproj/argo-workflows/blob/026b14ea418ccd98025a1343fca463ca58b1bef0/docs/workflow-controller-configmap.yaml#L36-L42

to false too?

  1. apiserver_storage_objects{resource="events"} ranges from 90k ~ 15k, with maximum 30k+, while the current cluster is only used to run argo workflows.
  2. nodeEvents is enabled
  3. I wrote an etcd-jdbc. It illustrates that:
    • workflows.argoproj.io would be frequently patched at the same time when workflow status varies, which etcd version increase rapidly, e.g. a single workflow has 370+ version.
    • Count of workflowtaskresult.argoproj.io would also increase rapidly, for a test argo cluster where I was tuning on, it has 35k+ entries.

Possible solutions: It works for my team for now. It is not guaranteed to be a nice solution for you.

agilgur5 commented 2 weeks ago

Oh I forgot to mention earlier, there is also the environment variable ALWAYS_OFFLOAD_NODE_STATUS that could help in this scenario as well

tooptoop4 commented 2 weeks ago

@leryn1122 can u see what exactly is being changed on workflows.argoproj.io/workflowtaskresult.argoproj.io ? also is it every 10 seconds?