I am currently evaluating argo-workflows a goto solution for scheduling tasks for my company. So far we really like it featurewise and we thing it is really good fit 👍
Problem is that it number of tasks is expected to be around 100k per workflow and so far I haven't manage to persuade argo to do that.
From what I've observed there is limitation imposed by maximum size of entity inside etcd db which is around 1.5 MB. From my testing this can be observed with following workflow
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i
spec:
podGC:
strategy: OnPodSuccess
deleteDelayDuration: 0s
entrypoint: e
templates:
- name: c
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [echo, "{{inputs.parameters.message}}"]
- name: e1
steps:
#@ for i in range(100):
- - name: #@ "message" + str(i)
template: c
arguments:
parameters:
- name: message
value: #@ "istep-" + str(i)
#@ end
- name: e
dag:
tasks:
#@ for i in range(1000):
- name: #@ "Step" + str(i)
template: e1
#@ end
You can use it with ytt -f <manifest_name> | kubectl create -f - -n <argo_namespace>. This manifest will get stuck at around 19177/20177 mark.
When I look at content of Workflow manifest it has states of each job inside it has jobs listed like this
Summary
I am currently evaluating argo-workflows a goto solution for scheduling tasks for my company. So far we really like it featurewise and we thing it is really good fit 👍
Problem is that it number of tasks is expected to be around 100k per workflow and so far I haven't manage to persuade argo to do that.
From what I've observed there is limitation imposed by maximum size of entity inside etcd db which is around 1.5 MB. From my testing this can be observed with following workflow
You can use it with
ytt -f <manifest_name> | kubectl create -f - -n <argo_namespace>
. This manifest will get stuck at around 19177/20177 mark.When I look at content of Workflow manifest it has states of each job inside it has jobs listed like this
Size of workflow manifest also roughly correlates to etcd limit:
Also when I decrease size of prefix I am able to schedule more jobs (around 80k with single character prefix)
What I am proposing is:
ALWAYS_OFFLOAD_NODE_STATUS
)Here is my current configuration for argo-workflows https://github.com/Hnatekmar/kubernetes/blob/a09391109103d5ff9036eed85fd05577fff1c654/manifests/applications/argo-workflows.yaml
Use Cases
When scheduling 100k or more jobs
Message from the maintainers:
Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.