argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.08k stars 3.2k forks source link

Proposal: Count-based Enhanced Depends Logic #3171

Open terrytangyuan opened 4 years ago

terrytangyuan commented 4 years ago

Summary

Support count-based enhanced depends logic in addition to existing depends logic based on dependent tasks and their statuses as outlined in the Enhanced Depends Logic document that has been introduced after v2.9.

Motivation

Currently, there exists a new called depends that allows users to specify dependent tasks, their statuses, as well as any complex boolean logic. The field is a string field and the syntax is expression-like with operands having the form <task-name>.<task-result>. Examples include task-1.Suceeded, task-2.Failed, and task-3.Damenoed. More details on this existing functionality can be found in the Enhanced Depends Logic document.

However, this requires users to specify task names explicitly which may be difficult or unnecessary to obtain, for example:

For example, some users might want to trigger the downstream task when N out of M dependent tasks have succeeded statuses. They may use Argo for use cases where specific statuses of the dependent tasks are not important.

Proposal

Extend the current depends field to be able to take complex comparison logic with operands having the form <task-result> without task names.

For example, the following depends string indicates that we only execute downstream tasks when at least 3 tasks succeeded and no task failed:

depends: "Succeeded > 3 && Failed == 0"

Another example is to have a built-in N in the implementation that represents the total number of dependent tasks so users do not need to know the total number of dependent tasks beforehand.

A simple case is only executing downstream tasks when all dependent tasks succeeded:

depends: "Succeeded == N"

Users should be able to do simple calculations like the following:

depends: "Succeeded > N - 10 && Failed < N"

The full list of available task results and is as follows:

Task Result Description
Succeeded Task Succeeded
Failed Task Failed
Errored Task Errored
Skipped Task Skipped
Completed Task Succeded or Failed
Daemoned Task is Daemoned and is not Pending

Full boolean logic is available:

The full list of available math operators that we plan to support initially include:

Compatibility

Because of the added control in depends is based on count. dag.task.continueOn is not available when using it. It is not possible to use both dependencies and depends in the same task group. Furthermore, users should only be able to use one of the two types of enhanced depends logic: count-based or task-name-based.


Message from the maintainers:

If you wish to see this enhancement implemented please add a πŸ‘ reaction to this issue! We often sort issues this way to know what to prioritize.

terrytangyuan commented 4 years ago

Opening this issue to discuss the idea before diving into the implementation. I'd like to take a stab at implementing this if others agree this is a good enhancement.

simster7 commented 4 years ago

Interesting suggestion, but I see a couple of problems with the current proposal.

First, the implementation of DAG execution logic is heavily dependent on having clearly-defined dependencies between tasks. This proposal – by its own definition – would not have well-defined dependencies between tasks which would mean that at the very least a large refactor of this core code would be needed, which is something we should approach with caution.

Second, any implementation would have to carefully consider the impotency requirement of the operator. Again, the lack of a clearly defined dependency graph would mean that the execution order of the graph would be determined arbitrarily at runtime. This stochasticity could introduce ambiguity to the DAG execution logic and would require a well-defined decision mechanism to solve. I'm not saying it's not possible, but it is a significant hurdle that needs to be cleared.

Some comments to your points:

There are situations where the tasks are dynamically generated and task names cannot be easily obtained.

This is mostly a non-issue and is largely by design. While it is true that some task names are dynamically generated – mainly expansion tasks when using withParams, withItems, etc – we currently do not support users specifying dependency on those tasks, but only to their parent (i.e. generator) tasks. I don't see why we should for the same problem that you tried to highlight and the points I made above: we need to have a well-defined DAG and we cannot have one if we don't know the names of the tasks before execution.

A better solution to this problem would be to introduce a new Argo Variable that keeps track of the number of succeeded tasks in a DAG. It could be then used to make execution control within when expressions in subsequent tasks or templates.

merlintang commented 4 years ago

does this related to trigger rule in the AirFlow, I know the trigger would be easy to used to decide the runtime of DAG.

terrytangyuan commented 4 years ago

@merlintang Yes that is part of my original motivation.

@simster7 Thank you for your thoughtful response! I agree that if we take the large refactoring and the impotency requirement into account, this may not be a good approach to pursue.

Controlling this in when expressions sounds like a good idea. Do you think a separate proposal is required or should we continue our discussion here if we want to take that approach? I am not familiar with that part of the code yet but will look into more in detail. Any pointers would be helpful as well.

simster7 commented 4 years ago

@terrytangyuan We can continue our discussion here. Here is how I envision the spec:

  - name: echo-thrice
    steps:
    - - name: expand-task
        template: expand-task
        arguments:
          parameters:
          - name: message
            value: "{{item}}"
        withItems: [1,2,3,4,5,6,7,8]
    - - name: may-run
        template: may-run
        when: "{{steps.expand-task.subTasks.Suceeded}} > 3"

A better name and syntax for {{steps.expand-task.subTasks.Suceeded}} is needed. Any suggestions?

You can look at https://github.com/argoproj/argo/pull/2911 for an example on how to make this change.

Looking forward to your PR

Ark-kun commented 4 years ago

TBH, I think adding this feature to Argo's core language would overcomplicate it even more and make it more fragile.

Maybe each of your tasks should put record it's completion in some storage/db and the service that watches/manages that storage/db can trigger the second pipeline stage once the dependencies are satisfied.

simster7 commented 4 years ago

I think adding this feature to Argo's core language would overcomplicate it even more and make it more fragile.

By "this feature" do you mean the original proposal (i.e. depends: "Succeeded > 3") or the Argo Variable approach (i.e. when: "{{steps.expand-task.subTasks.Suceeded}} > 3")?

If you mean the Argo Variable approach, could you please share your thoughts on why you think it would overcomplicate it/make it fragile?

Thanks @Ark-kun

terrytangyuan commented 4 years ago

@Ark-kun Could you elaborate a bit? IMO this should be something offered and handled by Argo. Putting the logic in DB/storage is not non-trivial in large systems and will likely introduce bugs/overheads.

simster7 commented 4 years ago

@terrytangyuan Are you still working on this?

terrytangyuan commented 4 years ago

@simster7 Thanks for checking. One concern with your suggested approach is that we could potentially introduce a new set of variables for various phases since we are not just dealing with successes.

jrowinski3d commented 2 years ago

I would like to gauge interest in this proposal. I have posted my general question here looking for this. https://github.com/argoproj/argo-workflows/discussions/7722#discussioncomment-2098901

davelajoie commented 2 years ago

Hello Jas and team, I agree with Jas (#7722). We will be building dags with 700+ nodes completely as dags, where fan-in and fan-out occurs multiple time in the dag. using ref count approach could definitely solve that.

loheagn commented 2 years ago

Hi, @terrytangyuan, I'd like to apply for this issue for the gsoc. I have sent you an email and look forward to your reply :)

terrytangyuan commented 2 years ago

Hi, @terrytangyuan, I'd like to apply for this issue for the gsoc. I have sent you an email and look forward to your reply :)

Thanks for reaching out but please read this document carefully: https://github.com/argoproj/argo-workflows/blob/master/docs/mentoring.md#how-to-participate-google-summer-of-code

loheagn commented 2 years ago

Hi, @terrytangyuan, I'd like to apply for this issue for the gsoc. I have sent you an email and look forward to your reply :)

Thanks for reaching out but please read this document carefully: https://github.com/argoproj/argo-workflows/blob/master/docs/mentoring.md#how-to-participate-google-summer-of-code

Thanks! I will read it.