Open leryn1122 opened 8 months ago
We run ~500 workflows and ~500 pods concurrently
So ~2500 concurrent Pods total?
It is expected that argo workflows do not flood into etcd or impact the stability of whole cluster.
For many Workflows and large Workflows, it may indeed stress the k8s API and etcd
. That's not really an Argo limitation, that's how k8s works with shared control plane resources.
There are a few features you may want to use that are well documented:
DEFAULT_REQUEUE_TIME
per https://argo-workflows.readthedocs.io/en/latest/running-at-massive-scale/#overwhelmed-kubernetes-api
--qps
and --burst
per https://argo-workflows.readthedocs.io/en/latest/scaling/#k8s-api-client-side-rate-limiting--workflow-ttl-workers
and --pod-cleanup-workers
per https://argo-workflows.readthedocs.io/en/latest/scaling/#adding-goroutines-to-increase-concurrencyparallelism
and mutexes and semaphores to limit the number of concurrent Workflows and tasks per https://argo-workflows.readthedocs.io/en/latest/synchronization/nodeStatusOffload
to move the status
subresource data out of etcd
and to a separate DB per https://argo-workflows.readthedocs.io/en/latest/offloading-large-workflows/
ALWAYS_OFFLOAD_NODE_STATUS=true
to do this for all Workflows, and not just those >1MB per https://argo-workflows.readthedocs.io/en/latest/environment-variables/EDIT: Some less documented options include:
nodeEvents
in the ConfigMap to save space in etcd
(at the cost of less available tracking): https://github.com/argoproj/argo-workflows/blob/026b14ea418ccd98025a1343fca463ca58b1bef0/docs/workflow-controller-configmap.yaml#L36-L42Status:
Recent Efforts:
argo_archived_workflows
reached ~250G. We wrote a cronjob to daily delete argo_archived_workflows
with workflows finished weeks ago to prevent archiving stuck.--workflow-ttl-workers
and --pod-cleanup-workers
: It was attempted to be modified. It works but does not save etcd from stress.
- Workflows behave different from business. Some of them finished in several seconds, while some run for a few hours.
Yea Workflows in general have a lot of diverse use-cases, so capacity planning can be challenging. Configurations that are ideal for short Workflows are not necessarily ideal for long Workflows, etc.
Arching should be asynchronized I think [sic]
Archiving is asynchronous. The entire Controller is async, it's all goroutines.
Archiving goes slowly, or even stuck after
argo_archived_workflows
reached ~250G.
This sounds like it might be getting CPU starved? Without detailed metrics etc it's pretty hard to dive into details.
It also sounds a bit like #11948, which was fixed in 3.4.14 and later. Not entirely the same though from the description (you have an etcd OOM vs a Controller OOM and your archive is growing vs your live Workflows).
- [x] I can confirm the issue exists when I tested with
:latest
v3.4.10
You also checked this box, but are not on latest
. Please fill out the issue template accurately, those questions are asked for very good reasons.
We wrote a cronjob to daily delete
argo_archived_workflows
with workflows finished weeks ago to prevent archiving stuck.
You can use archiveTTL
for this as a built-in option.
It works but does not save etcd from stress.
If you're creating as many Workflows as you're deleting, that sounds possible. Again, you didn't provide metrics, but those would be ideal to track when doing any sort of performance tuning.
A rapid/jump change leads to etcd be unstable in my experiences.
Are you on a managed k8s control plane provider? E.g. EKS, AKS, GKE, etc? Those try to auto-scale and have pre-set limits, so that can certainly happen. If you're using a self-managed control plane (e.g. kOps), you can vertically scale etcd and the rest of the k8s control plane (as well as horizontally scale to an extent, as etcd is fully consistent and so will eventually hit an upper bound).
We've urged developpers to reduces the size of workflow template from 200K to smaller ones. [sic]
I listed this in my previous comment -- nodeStatusOffload
can help with this.
Sorry, I was limited by NDA, and I am going to expose more details. Configuration and thresholds vary in past months.
You can use
archiveTTL
for this as a built-in option.
Current archiveTTL
is 7d.
Standalone MySQL instance quota: 60-80G mem and local nvme disk.
When the size of archived workflows within 30-45 days reached ~250G, queries and writing on table argo_archived_workflows
go slowly. A single SQL deleting workflows cost 2-3 minutes. We attempted to by the table index and MySQL hint, but it does not effect evidently. So I rebuilt MySQL and added a new hacking cronjob mentioned before and now it runs stable.
Are you on a managed k8s control plane provider? E.g. EKS, AKS, GKE, etc? Those try to auto-scale and have pre-set limits, so that can certainly happen. If you're using a self-managed control plane (e.g. kOps), you can set vertically scale etcd and the rest of the k8s control plane (as well as horizontally scale to an extent, as etcd is fully consistent an so will eventually hit an upper bound).
Self-manager cluster:
We've enabled etcd compact and compression regularly if triggers by the DB size metrics now. That is a hack.
I listed this in my previous comment
--nodeStatusOffload
can help with this.
It is enabled. Related config I could expose:
Persistence:
connectionPool:
maxIdleConns: 100
maxOpenConns: 0
connMaxLifetime: 0s
nodeStatusOffLoad: true
archive: true
archiveTTL: 7d
Workflow defaults:
spec:
ttlStrategy:
secondsAfterCompletion: 0
secondsAfterSuccess: 0
secondsAfterFailure: 0
podGC:
strategy: OnPodCompletion
parallelism: 3
Workflow controller args
args:
- '--configmap'
- workflow-controller-configmap
- '--executor-image'
- 'xxxxx/argoexec:v3.4.10'
- '--namespaced'
- '--workflow-ttl-workers=8' # 4->8
- '--pod-cleanup-workers=32' # 4->32
- '--workflow-workers=64' # 32->64
- '--qps=50'
- '--kube-api-burst=90' # 60->90
- '--kube-api-qps=60' # 40->60
Executor config
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 10m
memory: 64Mi
limits:
cpu: 1000m
memory: 512Mi
There are some desensitized etcd and argo metrics screenshots, where the first one shows etcd db size varies rapidly, and the following one shows the count of workflows and pods in argo namespace at the same time.
This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.
@leryn1122 can u include graph of apiserver_storage_objects{resource="events"} ?
i'm facing same issue and raised https://github.com/argoproj/argo-workflows/issues/13042 + https://github.com/argoproj/argo-workflows/issues/13089
i wonder if ARGO_PROGRESS_PATCH_TICK_DURATION as 0 will help to make less PATCH events too
are u setting https://github.com/argoproj/argo-workflows/blob/026b14ea418ccd98025a1343fca463ca58b1bef0/docs/workflow-controller-configmap.yaml#L36-L42 to false too?
@leryn1122 can u include graph of apiserver_storage_objects{resource="events"} ?
i'm facing same issue and raised #13042 + #13089
i wonder if ARGO_PROGRESS_PATCH_TICK_DURATION as 0 will help to make less PATCH events too
are u setting
to false too?
apiserver_storage_objects{resource="events"}
ranges from 90k ~ 15k, with maximum 30k+, while the current cluster is only used to run argo workflows.nodeEvents
is enabledworkflows.argoproj.io
would be frequently patched at the same time when workflow status varies, which etcd version increase rapidly, e.g. a single workflow has 370+ version.workflowtaskresult.argoproj.io
would also increase rapidly, for a test argo cluster where I was tuning on, it has 35k+ entries.Possible solutions: It works for my team for now. It is not guaranteed to be a nice solution for you.
Oh I forgot to mention earlier, there is also the environment variable ALWAYS_OFFLOAD_NODE_STATUS
that could help in this scenario as well
@leryn1122 can u see what exactly is being changed on workflows.argoproj.io/workflowtaskresult.argoproj.io ? also is it every 10 seconds?
Pre-requisites
:latest
What happened/what did you expect to happen?
We run ~500 workflows and ~500 pods concurrently as offline tasks in prod env. Etcd got full rapidly at the size of 8G. It resulted in that etcd and apiserver turned into unavailable and the argo workflow controller auto restarted frequently. Our team concluded that etcd and apiserver may be unavailable if running and pending workflows flood into etcd according to monitoring and metrics.
For now, the team’s solutions are:
It is expected that argo workflows do not flood into etcd or impact the stability of whole cluster.
Version
v3.4.10
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container