argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.05k stars 3.2k forks source link

Artifact "Cycle" Crashes UI #13395

Open andrewm-aero opened 3 months ago

andrewm-aero commented 3 months ago

Pre-requisites

What happened? What did you expect to happen?

When viewing a workflow in the UI that has an "artifact cycle", that is, an input artifact is later overwritten as an output artifact of a later step/task, the UI crashes with the following error

Maximum call stack size exceeded
 [Reload this page](javascript:document.location.reload();) to try again.

Stack Trace
RangeError: Maximum call stack size exceeded
    at Map.forEach (<anonymous>)
    at e.outgoingEdges (https://<redacted>/main.1e7905924d7399ee8ad2.js:2:1859902)
    at e.visit (https://<redacted>/main.1e7905924d7399ee8ad2.js:2:1841652)
    at https://<redacted>/main.1e7905924d7399ee8ad2.js:2:1841699
    at Array.forEach (<anonymous>)
    at e.visit (https://<redacted>/main.1e7905924d7399ee8ad2.js:2:1841669)
    at https://<redacted>/main.1e7905924d7399ee8ad2.js:2:1841699
    at Array.forEach (<anonymous>)
    at e.visit (https://<redacted>/main.1e7905924d7399ee8ad2.js:2:1841669)
    at https://<redacted>/main.1e7905924d7399ee8ad2.js:2:1841699
Component Stack

    in Fn
    in t
    in Ri
    in div
    in div
    in div
    in div
    in div
    in Mr
    in us
    in t
    in t
    in component
    in t
    in t
    in t
    in div
    in div
    in Pt
    in t
    in t
    in rl
    in al

While this is baseless speculation, I suspect the issue is that whatever code is (correctly) assuming that the node graph is acyclic, is also (incorrectly) assuming that artifacts are acyclic.

For context, this pattern is useful to record a "high watermark", and have the workflow terminate or skip steps if e.g. no new data is available for processing, but update the same object with a new watermark after processing new data. This is particularly useful when combined with a CronWorkflow that periodically checks for and processes new data.

The attached workflow crashes the only ONLY when the "use faster, but less pretty ..." and "toggle artifacts" options are selected, and was tested with the initials contents of the s3 artifact as {"this": "is a test file"}.

Version(s)

v3.5.8

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: stack-overflow-test-
spec:
  podGC:
    strategy: OnWorkflowCompletion
  entrypoint: main
  templates:
    # Main function
    - name: main
      steps:
      - - name: step-a
          template: step-a
          arguments:
            artifacts:
              - name: test
                archive:
                  none: {}
                s3:
                  accessKeySecret:
                    key: AWS_ACCESS_KEY_ID
                    name: <redacted>
                  bucket: <redacted>
                  endpoint: <redacted>
                  insecure: false
                  key: stack-overflow-test
                  region: <redacted>
                  secretKeySecret:
                    key: AWS_SECRET_ACCESS_KEY
                    name: <redacted>

      - - name: step-b
          template: step-b
          arguments:
            parameters:
            - name: test
              value: "{{=sprig.fromJson(steps['step-a'].outputs.parameters['test'])['this']}}"

      - - name: step-c
          when: "{{=steps['step-b'].outputs.parameters['test'] != ''}}"
          template: step-c
          arguments:
            parameters:
            - name: test
              value: "{{=sprig.fromJson(steps['step-b'].outputs.parameters['test'])['that']}}"

      - - name: step-d
          when: "{{=steps['step-b'].outputs.parameters['test'] != ''}}"
          arguments:
            artifacts:
            - name: test
              from: '{{steps.step-b.outputs.artifacts.test}}' 
          inline: 
            inputs:
              artifacts:
              - name: test
                path: /tmp/test
                archive:
                  none: {}
            outputs:
              artifacts:
              - name: test
                path: /tmp/test
                archive:
                  none: {}
                s3:
                  accessKeySecret:
                    key: AWS_ACCESS_KEY_ID
                    name: <redacted>
                  bucket: <redacted>
                  endpoint: <redacted>
                  insecure: false
                  key: stack-overflow-test
                  region: <redacted>
                  secretKeySecret:
                    key: AWS_SECRET_ACCESS_KEY
                    name: <redacted>
            container:
              image: docker.io/argoproj/argosay:v2
              command: [/argosay, echo, 'This is a test']

    - name: step-a
      inputs:
        artifacts:
        - name: test
          path: /tmp/test
          archive:
            none: {}
      outputs:
        artifacts:
        - name: test
          path: /tmp/test
          archive:
            none: {}
        parameters:
        - name: test
          valueFrom:
            path: /tmp/test
      container:
        image: docker.io/argoproj/argosay:v2
        command: [/argosay, echo, 'This is a test']
    - name: step-b
      inputs:
        parameters:
        - name: test
      outputs:
        artifacts:
        - name: test
          path: /tmp/test
          archive:
            none: {}
        parameters:
        - name: test
          valueFrom:
            path: /tmp/test
      script:
        image: docker.io/library/alpine:3
        command: [/bin/sh, -xec]
        env:
        - name: TEST
          value: '{{inputs.parameters.test}}'
        args:
        - |
          echo '{"that": "'"${TEST}"'"}' > /tmp/test
    - name: step-c
      inputs:
        parameters:
        - name: test

      container:
        image: docker.io/argoproj/argosay:v2
        command: [/argosay, echo, 'This is a test']

Logs from the workflow controller

N/A

Logs from in your workflow's wait container

N/A
agilgur5 commented 3 months ago

Yea Maximum call stack size exceeded is highly likely to be due to an infinite loop

You may or may not be able to workaround this with the alternate visualization algorithm in the UI, usable when you toggle the "lightning bolt" button.

I will be working on producing a minimal example tomorrow.

That would also be helpful to understand what scenario is going on as I didn't quite follow the description. "Overwritten" as in used the same location? Since artifacts otherwise don't overlap

andrewm-aero commented 3 months ago

Yea Maximum call stack size exceeded is highly likely to be due to an infinite loop

You may or may not be able to workaround this with the alternate visualization algorithm in the UI, usable when you toggle the "lightning bolt" button.

This isn't an option unfortunately, it crashes immediately before even displaying the toolbar. Toggling the icon in another workflow and then returning to the affected workflow page produces the same error.

I will be working on producing a minimal example tomorrow.

That would also be helpful to understand what scenario is going on as I didn't quite follow the description. "Overwritten" as in used the same location? Since artifacts otherwise don't overlap

Yes. Consider the scenario where step A takes an input artifact from a specific key from a specific bucket, then step B has an output artifact stored in the same bucket and at the same key. The actual workflows have several steps in between, but that's the gist of it.

andrewm-aero commented 3 months ago

After some work, I was able to get a version of the workflow with no identifying info that reproduces the behavior. One thing I noticed, but didn't expect, was that simply having the cycle did not cause the effect, and if any of the {"archive": {"none": {}}}'s are removed, the crash doesn't happen.

agilgur5 commented 3 months ago

Thanks for providing the repro!

One thing I noticed, but didn't expect, was that simply having the cycle did not cause the effect, and if any of the {"archive": {"none": {}}}'s are removed, the crash doesn't happen.

Yea I was thinking about this, and the artifacts shouldn't be used during graph iteration. I have a feeling that either the status got corrupted somehow or the deserialization somehow corrupted the object in-memory, causing a circular reference somewhere. The latter code hasn't changed in years, so my gut thought is that it's the status (and there have been some wacky bugs found in the artifact status processing recently like #12845). But TBD

tooptoop4 commented 3 days ago

somewhere in https://github.com/argoproj/argo-workflows/blob/5d893b161bbe4833d578be9f6c0322849215c23f/ui/src/shared/components/graph/graph-panel.tsx#L143 https://github.com/argoproj/argo-workflows/blob/5d893b161bbe4833d578be9f6c0322849215c23f/ui/src/shared/components/graph/layout.ts#L8 https://github.com/argoproj/argo-workflows/blob/5d893b161bbe4833d578be9f6c0322849215c23f/ui/src/shared/components/graph/fast-layout.ts#L7 https://github.com/argoproj/argo-workflows/blob/5d893b161bbe4833d578be9f6c0322849215c23f/ui/src/shared/components/graph/coffman-graham-sorter.ts#L15 https://github.com/argoproj/argo-workflows/blob/5d893b161bbe4833d578be9f6c0322849215c23f/ui/src/shared/components/graph/dfs-sorter.ts#L22 https://github.com/argoproj/argo-workflows/blob/5d893b161bbe4833d578be9f6c0322849215c23f/ui/src/shared/components/graph/types.ts#L35