argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.21k forks source link

Offload TaskResultsCompletionStatus from etcd to db or use compression to allow large worfklows (~100k) #13783

Open Hnatekmar opened 1 month ago

Hnatekmar commented 1 month ago

Summary

I am currently evaluating argo-workflows a goto solution for scheduling tasks for my company. So far we really like it featurewise and we thing it is really good fit 👍
Problem is that it number of tasks is expected to be around 100k per workflow and so far I haven't manage to persuade argo to do that.

From what I've observed there is limitation imposed by maximum size of entity inside etcd db which is around 1.5 MB. From my testing this can be observed with following workflow

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i
spec:
  podGC:
    strategy: OnPodSuccess
    deleteDelayDuration: 0s
  entrypoint: e
  templates:
  - name: c
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo, "{{inputs.parameters.message}}"]
  - name: e1
    steps:
      #@ for i in range(100):
      - - name: #@ "message" + str(i)
          template: c
          arguments:
            parameters:
              - name: message
                value: #@ "istep-" + str(i)
      #@ end
  - name: e
    dag:
      tasks:
    #@ for i in range(1000):
        - name: #@ "Step" + str(i)
          template: e1
    #@ end

You can use it with ytt -f <manifest_name> | kubectl create -f - -n <argo_namespace>. This manifest will get stuck at around 19177/20177 mark.

When I look at content of Workflow manifest it has states of each job inside it has jobs listed like this

      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4292625616: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4293120823: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4293149953: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4293305504: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4294093307: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4294368260: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4294498843: true

Size of workflow manifest also roughly correlates to etcd limit:

$ kubectl get workflow -n argo-workflows -o yaml  |  wc -c
 1694314

Also when I decrease size of prefix I am able to schedule more jobs (around 80k with single character prefix)

What I am proposing is:

Here is my current configuration for argo-workflows https://github.com/Hnatekmar/kubernetes/blob/a09391109103d5ff9036eed85fd05577fff1c654/manifests/applications/argo-workflows.yaml

Use Cases

When scheduling 100k or more jobs


Message from the maintainers:

Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.