Summary

I am currently evaluating argo-workflows a goto solution for scheduling tasks for my company. So far we really like it featurewise and we thing it is really good fit 👍
Problem is that it number of tasks is expected to be around 100k per workflow and so far I haven't manage to persuade argo to do that.

From what I've observed there is limitation imposed by maximum size of entity inside etcd db which is around 1.5 MB. From my testing this can be observed with following workflow

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i
spec:
  podGC:
    strategy: OnPodSuccess
    deleteDelayDuration: 0s
  entrypoint: e
  templates:
  - name: c
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo, "{{inputs.parameters.message}}"]
  - name: e1
    steps:
      #@ for i in range(100):
      - - name: #@ "message" + str(i)
          template: c
          arguments:
            parameters:
              - name: message
                value: #@ "istep-" + str(i)
      #@ end
  - name: e
    dag:
      tasks:
    #@ for i in range(1000):
        - name: #@ "Step" + str(i)
          template: e1
    #@ end

You can use it with ytt -f <manifest_name> | kubectl create -f - -n <argo_namespace>. This manifest will get stuck at around 19177/20177 mark.

When I look at content of Workflow manifest it has states of each job inside it has jobs listed like this

      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4292625616: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4293120823: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4293149953: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4293305504: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4294093307: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4294368260: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4294498843: true

Size of workflow manifest also roughly correlates to etcd limit:

$ kubectl get workflow -n argo-workflows -o yaml  |  wc -c
 1694314

Also when I decrease size of prefix I am able to schedule more jobs (around 80k with single character prefix)

What I am proposing is:

change of format https://github.com/argoproj/argo-workflows/blob/c9b1477fd575bf06bed43ca2139f74aa3af4285c/pkg/apis/workflow/v1alpha1/workflow_types.go#L1955 to single key with base64 compressed string
Or we can offload this to db (when enabled I don't thing anyone will try this without ALWAYS_OFFLOAD_NODE_STATUS)

Here is my current configuration for argo-workflows https://github.com/Hnatekmar/kubernetes/blob/a09391109103d5ff9036eed85fd05577fff1c654/manifests/applications/argo-workflows.yaml

Use Cases

When scheduling 100k or more jobs

Message from the maintainers:

Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.

argoproj / argo-workflows

Offload TaskResultsCompletionStatus from etcd to db or use compression to allow large worfklows (~100k) #13783

Summary

Use Cases