argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.08k stars 3.2k forks source link

Retry doesn't respect dag parallelism #10137

Open washcycle opened 1 year ago

washcycle commented 1 year ago

Pre-requisites

What happened/what you expected to happen?

When a dag with parallelism fails and the user retires in the web GUI, the parallelism is no longer respected.

I expect retries to respect the level of parallelism for dags.

An non-ideal workaround is to specify parallelism at the workflow spec level as well.

spec:
  entrypoint: A
  parallelism: 2
  templates:
  - name: A
    parallelism: 2

Version

3.4.3

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: parallelism-nested-dag-              
spec:
  entrypoint: A
  templates:
  - name: A
    parallelism: 2
    dag:
      tasks:
      - name: b1
        template: B
        arguments:
          parameters:
          - name: msg
            value: "1"
      - name: b2
        template: B
        depends: "b1"
        arguments:
          parameters:
          - name: msg
            value: "2"
      - name: b3
        template: B
        depends: "b1"
        arguments:
          parameters:
          - name: msg
            value: "3"
      - name: b4
        template: B
        depends: "b1"
        arguments:
          parameters:
          - name: msg
            value: "4"
      - name: b5
        template: B
        depends: "b2 && b3 && b4"
        arguments:
          parameters:
          - name: msg
            value: "5"

  - name: B
    inputs:
      parameters:
      - name: msg
    dag:
      tasks:
      - name: c1
        template: one-job
        arguments:
          parameters:
          - name: msg
            value: "{{inputs.parameters.msg}} c1"
      - name: c2
        template: one-job
        depends: "c1"
        arguments:
          parameters:
          - name: msg
            value: "{{inputs.parameters.msg}} c2"
      - name: c3
        template: bad-job
        depends: "c1"
        arguments:
          parameters:
          - name: msg
            value: "{{inputs.parameters.msg}} c3"
      - name: c4
        template: bad-job
        depends: "c1"
        arguments:
          parameters:
          - name: msg
            value: "{{inputs.parameters.msg}} c3"        
      - name: c5
        template: bad-job
        depends: "c1"
        arguments:
          parameters:
          - name: msg
            value: "{{inputs.parameters.msg}} c3"  
      - name: c6
        template: bad-job
        depends: "c1"
        arguments:
          parameters:
          - name: msg
            value: "{{inputs.parameters.msg}} c3"                              

  - name: one-job
    inputs:
      parameters:
      - name: msg
    container:
      image: alpine
      command: ['/bin/sh', '-c']
      args: ["echo {{inputs.parameters.msg}}; sleep 10"]
    metadata:
      labels: 
        "aadpodidbinding": "airflow-prod-identity"         

  - name: bad-job
    inputs:
      parameters:
      - name: msg
    container:
      image: python
      command: ["python", "-c"]
      # fail with a 80% probability
      args: ["import random; import sys; exit_code = 1; sys.exit(exit_code)"]

Logs from the workflow controller

I1129 16:43:13.691421       1 round_trippers.go:553] GET https://10.0.0.1:443/apis/argoproj.io/v1alpha1/namespaces/argo/workflows/parallelism-bug-dag-s7pff 200 OK in 5 milliseconds
I1129 16:43:13.693389       1 round_trippers.go:553] GET https://10.0.0.1:443/apis/argoproj.io/v1alpha1/namespaces/argo/workflows?fieldSelector=metadata.name%3Dparallelism-bug-dag-s7pff&labelSelector=%21workflows.argoproj.io%2Fcontroller-instanceid&watch=true 200 OK in 1 milliseconds
time="2022-11-29T16:43:13.694Z" level=debug msg="Sending workflow event" phase=Running type=ADDED workflow=parallelism-bug-dag-s7pff
time="2022-11-29T16:43:17.142Z" level=debug msg="Sending workflow event" phase=Running type=MODIFIED workflow=parallelism-bug-dag-s7pff
time="2022-11-29T16:43:27.161Z" level=debug msg="Sending workflow event" phase=Failed type=MODIFIED workflow=parallelism-bug-dag-s7pff
time="2022-11-29T16:43:28.495Z" level=debug msg="Sending workflow event" phase=Failed type=MODIFIED workflow=parallelism-bug-dag-s7pff
time="2022-11-29T16:44:07.480Z" level=debug msg="Sending workflow event" phase=Running type=MODIFIED workflow=parallelism-bug-dag-s7pff
time="2022-11-29T16:44:07.598Z" level=debug msg="Sending workflow event" phase=Running type=MODIFIED workflow=parallelism-bug-dag-s7pff
time="2022-11-29T16:44:17.500Z" level=debug msg="Sending workflow event" phase=Running type=MODIFIED workflow=parallelism-bug-dag-s7pff
time="2022-11-29T16:44:27.555Z" level=debug msg="Sending workflow event" phase=Failed type=MODIFIED workflow=parallelism-bug-dag-s7pff
time="2022-11-29T16:44:28.921Z" level=debug msg="Sending workflow event" phase=Failed type=MODIFIED workflow=parallelism-bug-dag-s7pff

Logs from in your workflow's wait container

time="2022-11-29T16:44:10.935Z" level=info msg="Starting Workflow Executor" version=v3.4.3
time="2022-11-29T16:44:11.009Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2022-11-29T16:44:11.009Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=parallelism-bug-dag-s
7pff-bad-job-2700523482 template="{\"name\":\"bad-job\",\"inputs\":{\"parameters\":[{\"name\":\"msg\",\"value\":\"1 c3\"}]},\"outputs\":{},\"container\":{\"name\":\"\",\"image\":\"python\",\"command\":[\"python\",\"-c\"],\"args\":[\"import random; import sys; exit_code = 1; sys.exit(exit
_code)\"],\"resources\":{}},\"archiveLocation\":{\"archiveLogs\":true,\"azure\":{\"endpoint\":\"https://xxxxxxx.blob.core.windows.net\",\"container\":\"logs\",\"useSDKC
reds\":true,\"blob\":\"parallelism-bug-dag-s7pff/parallelism-bug-dag-s7pff-bad-job-2700523482\"}}}" version="&Version{Version:v3.4.3,BuildDate:2022-10-31T05:40:15Z,GitCommit:eddb1b7
8407adc72c08b4ed6be8f52f2a1f1316a,GitTag:v3.4.3,GitTreeState:clean,GoVersion:go1.18.7,Compiler:gc,Platform:linux/amd64,}"
time="2022-11-29T16:44:11.009Z" level=info msg="Starting deadline monitor"
time="2022-11-29T16:44:13.010Z" level=info msg="Main container completed" error="<nil>"
time="2022-11-29T16:44:13.010Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2022-11-29T16:44:13.010Z" level=info msg="No output parameters"
time="2022-11-29T16:44:13.010Z" level=info msg="No output artifacts"
time="2022-11-29T16:44:13.010Z" level=info msg="Saving to Azure Blob Storage" blob=parallelism-bug-dag-s7pff/parallelism-bug-dag-s7pff-bad-job-2700523482/main.log container=logs end
point="https://xxxxxx.blob.core.windows.net"
time="2022-11-29T16:44:13.130Z" level=info msg="Save artifact" artifactName=main-logs duration=120.662454ms error="<nil>" key=parallelism-bug-dag-s7pff/parallelism-bug-dag-s7pff-bad
-job-2700523482/main.log
time="2022-11-29T16:44:13.130Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2022-11-29T16:44:13.130Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2022-11-29T16:44:13.159Z" level=info msg="Create workflowtaskresults 201"
time="2022-11-29T16:44:13.160Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2022-11-29T16:44:13.160Z" level=info msg="Deadline monitor stopped"
time="2022-11-29T16:44:13.160Z" level=info msg="Alloc=9471 TotalAlloc=15381 Sys=23762 NumGC=4 Goroutines=11"
sarabala1979 commented 1 year ago

@washcycle Can you provide more details about the scenario? GUI retry will restart the workflow from the fail step or task. Parallelism will control the # of parallel tasks during workflow execution.

washcycle commented 1 year ago

What happens is that parallelism of the task seems to be ignored when using the GUI retry. If it had five failed tasks instead of retrying two tasks at a time it will retry all five at the same time.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.