argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.04k stars 3.2k forks source link

Indirect infinite recursion doesn't get prevented #11499

Closed Joibel closed 1 year ago

Joibel commented 1 year ago

Pre-requisites

What happened/what you expected to happen?

With a workflow which recurses the controller will attempt to recurse forever. Take the recursion example from the documentaton and make it always coinflip for an example. It feels like this should be prevented by having a configurable maximum recursion depth.

Version

Current master

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: recursive2-
spec:
  entrypoint: coinflip
  templates:
  - name: coinflip
    steps:
    # flip a coin
    - - name: flip-coin
        template: flip-coin
    # evaluate the result in parallel
    - - name: heads
        template: coinflip              # call heads template if "heads"
        when: "{{steps.flip-coin.outputs.result}} == heads"
      - name: tails                     # keep flipping coins if "tails"
        template: coinflip
        when: "{{steps.flip-coin.outputs.result}} == tails"

  - name: coinflip
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        import random
        result = "heads" if random.randint(0,1) == 0 else "tails"
        print(result)

  - name: heads
    container:
      image: alpine:3.6
      command: [sh, -c]
      args: ["echo \"it was heads\""]

Logs from the workflow controller

Not interesting

Logs from in your workflow's wait container

Not relevant
terrytangyuan commented 1 year ago

Perhaps some additional checks can be specified in the when condition?

Joibel commented 1 year ago

Perhaps some additional checks can be specified in the when condition?

The actual workflow here isn't the problem, it's an obvious case which will recurse indefinitely. I'm really looking at trying to stop much more complex workflows where the recursion isn't obvious from inspection, but the outcome of chewing up resources is undesirable.

4180 has thought about this problem before and a fix wasn't agreed on, but the basic problem does remain and does happen in the real world.

jessesuen commented 1 year ago

~I'm pretty sure there is a depth check to prevent recursing down templates forever. It might not be configurable but I recall it being added.~

EDIT: nevermind I was misremembering WorkflowTemplate resolution for having infinite loop resolution.

terrytangyuan commented 1 year ago

@alexec Any context around closing https://github.com/argoproj/argo-workflows/pull/4193 and https://github.com/argoproj/argo-workflows/issues/4180? Was it a matter of prioritization? Now that we have more contributors willing to contribute, we may want to revisit.

terrytangyuan commented 1 year ago

@jessesuen Yes the existing check is only for nested workflow templates.

Joibel commented 1 year ago

@terrytangyuan, @alexc - Would a resurrection of an updated #4193 be acceptable as a resolution to this? I can't find the check for nested workflow templates - can someone point me at it?

JPZ13 commented 1 year ago

@terrytangyuan @alexec - did y'all have a chance to see @Joibel's message above? I'd also nominate this to be a P2 or a P1 given that #4180 was a P1 and a P2 before being closed

terrytangyuan commented 1 year ago

Yeah we should at least provide a configuration to prevent this. I don't know why the PR got closed. Alex would have more context.

I can't find the check for nested workflow templates - can someone point me at it?

@Joibel You removed it in https://github.com/argoproj/argo-workflows/pull/10785. What was the reason behind it?

Joibel commented 1 year ago

Yeah we should at least provide a configuration to prevent this. I don't know why the PR got closed. Alex would have more context.

I can't find the check for nested workflow templates - can someone point me at it?

@Joibel You removed it in #10785. What was the reason behind it?

I removed it because it didn't do anything any more, and I didn't realise that we still wanted it.

depth was always zero as it was passed as a literal 0 only.

Joibel commented 1 year ago

I'd be happy to revive that implementation and fix it if that's preferred over #11646

agilgur5 commented 1 year ago

Closing this out now that #11646 has been merged