Accessing step `.main-logs` in WorkflowTemplate no longer works after upgrade v3.4.11 to v3.4.12

ndejong commented 11 months ago

Pre-requisites

[X] I have double-checked my configuration
[X] I can confirm the issues exists when I tested with :latest
[ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

We have workflow templates that reference as argument artifact inputs the .main-logs output artifact from the a previous step. This arrangement has been working well until a recent upgrade from v3.4.11 to v3.4.12 (same also occurs when upgrading to v3.5.x)

The HTTP400 error response content looks like this

{
  "code":3,
  "message":"templates.entrypoint-handler.steps failed to resolve {{ steps.foobar.outputs.artifacts.main-logs }}"
}

The supplied sample workflow will work fine in v3.4.11 and prior but fail in v3.4.12 (and v3.5.x)

Version

v3.4.12

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate

metadata:
  name: bugreport-20231023a

spec:
  workflowMetadata: {}
  entrypoint: entrypoint-handler

  templates:
    - name: entrypoint-handler
      steps:

        - - name: foobar
            template: foobar
            arguments:
              parameters:
                - { name: target, value: "hello world" }

        - - name: workflow-processor
            templateRef:
              name: bugreport-20231023a-templates
              template: workflow-processor
            arguments:
              artifacts:
                - { name: artifact_engine_log, from: "{{ steps.foobar.outputs.artifacts.main-logs }}" }

    - name: foobar
      inputs:
        parameters:
          - { name: target }
      container:
        image: docker/whalesay
        command: [ "cowsay" ]
        args: [ "{{ inputs.parameters.target }}" ]

---

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate

metadata:
  name: bugreport-20231023a-templates

spec:

  templates:
    - name: workflow-processor

      inputs:
        artifacts:
          - { name: artifact_engine_log, path: "/data/artifact_engine_log", optional: true }

      container:
        image: alpine:latest
        command: ["cat"]
        args: [ "/data/artifact_engine_log" ]

Logs from the workflow controller

time="2023-10-24T22:56:42.450Z" level=info duration=1.173461ms method=GET path=index.html size=473 status=0
time="2023-10-24T22:56:47.672Z" level=info msg="finished unary call with code InvalidArgument" error="rpc error: code = InvalidArgument desc = templates.entrypoint-handler.steps failed to resolve {{ steps.foobar.outputs.artifacts.main-logs }}" grpc.code=InvalidArgument grpc.method=SubmitWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2023-10-24T22:56:47Z" grpc.time_ms=72.616 span.kind=server system=grpc
time="2023-10-24T22:56:47.676Z" level=info duration=84.197433ms method=POST path=/api/v1/workflows/redacted-workflows/submit size=122 status=400

Logs from in your workflow's wait container

NA

tooptoop4 commented 11 months ago

maybe caused by https://github.com/argoproj/argo-workflows/pull/11781 ?

terrytangyuan commented 11 months ago

cc @shmruin Would you like to take a look?

agilgur5 commented 11 months ago

Possible workaround may be to use an expr expression, i.e. {{=steps.foobar.outputs.artifacts['main-logs']}}

shmruin commented 11 months ago

Yes, this is because of #11781. It seems start checking resolveAllVariables makes problem.

The original code doesn't check this because when checkValidWorkflowVariablePrefix, the untrimmed steps. is not match to steps. so resolveAllVariables just returns nil without error. (which is checking global variable prefix)

But with the change of trimming, as #11871 does, it starts checking the tag and steps.foobar.outputs.artifacts.main-logs does not pass this. I check this by some small print as below.

Bad Request: templates.entrypoint-handler.steps failed to resolve {{ steps.foobar.outputs.artifacts.main-logs }}, scope is map[steps.foobar.exitCode:true steps.foobar.finishedAt:true steps.foobar.hostNodeName:true steps.foobar.id:true steps.foobar.outputs.result:true steps.foobar.startedAt:true steps.foobar.status:true steps.workflow-processor.status:true workflow.annotations:placeholder-5 workflow.annotations.json:placeholder-6 workflow.labels:placeholder-7 workflow.labels.json:placeholder-8 workflow.mainEntrypoint:placeholder-2 workflow.name:placeholder-0 workflow.namespace:placeholder-1 workflow.serviceAccountName:placeholder-3 workflow.status:placeholder-9 workflow.uid:placeholder-4], globalParam is map[workflow.annotations:placeholder-5 workflow.annotations.json:placeholder-6 workflow.labels:placeholder-7 workflow.labels.json:placeholder-8 workflow.mainEntrypoint:placeholder-2 workflow.name:placeholder-0 workflow.namespace:placeholder-1 workflow.serviceAccountName:placeholder-3 workflow.status:placeholder-9 workflow.uid:placeholder-4], tag is ' steps.foobar.outputs.artifacts.main-logs ', trimmedTag is 'steps.foobar.outputs.artifacts.main-logs'

It looks like ...artifacts not exist in scope. (but workflow.outputs.artifacts. is exist)

So, steps.foobar.outputs.artifacts.main-logs is neither scope or globalParams so it should pass some if-else cases but nothing match so makes error.

Does it mean that the case checks of resolveAllVariables should be updated? I want to know proper validation check case that something like steps.foobar.outputs.artifacts.main-logs should be placed.

agilgur5 commented 11 months ago

Thanks for the in-depth investigation @shmruin!

Does it mean that the case checks of resolveAllVariables should be updated? I want to know proper validation check case that something like steps.foobar.outputs.artifacts.main-logs should be placed.

It seems like the actual execution has been allowing this for some time, so it indeed sounds like resolveAllVariables needs updating. Basically the validation did not match the actual execution. There might be some other cases like this one that were missed due to the lack of trimming 😕 Basically one bug was hiding a few more validation bugs underneath.

shmruin commented 11 months ago

It looks like 'artifacts' keyword of output is quite special. By watching the log again, it is not in 'scope' but should works as a keyword. Maybe I can just generalize 'workflows.output.artifacts.' prefix case to something like '*.outputs.artifacts'.

shmruin commented 11 months ago

@agilgur5 , is main-log be dealt with some artifact without explicitly using output? I check that this error happens because foobar step has no ouput artifacts keyword. If it is like below, then that template will works.

    - name: foobar
      inputs:
        parameters:
          - { name: target }
      container:
        image: docker/whalesay
        command: [ "cowsay" ]
        args: [ "{{ inputs.parameters.target }}" ]
      outputs:
        artifacts:
          - name: main-log
            path: SOME_MAIN_LOG_PATH

See addOutputsToScope.

If it is the only case, then I think we can make main-log to pass the validation even when not using output keyword.

agilgur5 commented 11 months ago

@agilgur5 , is main-log be dealt with some artifact without explicitly using output?

oh I actually missed that, good catch!

If so, this isn't documented anywhere I could find. That may be how archive logs function? @ndejong are you using archive logs to store your logs as artifacts?

ndejong commented 11 months ago

Thanks for the time and attention on this -

@shmruin - unless I'm mis understanding something here, I don't see how this works

      output:
        artifacts:
          - name: main-log
            path: SOME_MAIN_LOG_PATH

By using the .main-logs approach as per our use case we are accessing the terminal output of the Argo Workflow itself, not some artifact that is written within the container as this seems to suggest.

@agilgur5 - yes we are picking up other artifacts from both steps in our workflows, our ability to access the .main-logs in the later step(s) allows us to make decisions based on the terminal output(s) from the earlier step.

Is there some other way to access the Argo Workflow terminal-output (ie .main-log) of a step in a later-step that we are not understanding here?

agilgur5 commented 11 months ago

@agilgur5 - yes we are picking up other artifacts from both steps in our workflows, our ability to access the .main-logs in the later step(s) allows us to make decisions based on the terminal output(s) from the earlier step.

Is this specifically via the archive logs feature? If so, this isn't mentioned in your issue or explicitly used in the Workflows you provided -- could you also provide your workflow-controller-configmap and/or your artifact-repositories ConfigMap? I imagine that you have an archive logs configuration there

Is there some other way to access the Argo Workflow terminal-output (ie .main-log) of a step in a later-step that we are not understanding here?

The documented way of doing this is accessing through outputs.result.

As far as I can currently tell, main-log seems to be undocumented behavior of the archive logs feature that you're making use of. There is no explicit reference to that in the entire codebase -- I believe that just happens to be what your archive logs artifact is named in your global or namespace ConfigMap rather than a specific keyword that Argo uses

ndejong commented 11 months ago

Is this specifically via the archive logs feature? If so, this isn't mentioned in your issue or explicitly used in the Workflows you provided -- could you also provide your workflow-controller-configmap and/or your artifact-repositories ConfigMap? I imagine that you have an archive logs configuration there

Yes I see, had not occurred this was a possibility here; yes we use a local minio instance to provide temporary artifact storage in-between steps; indeed the defaults section of our workflow-controller-configmap contains a archiveLogs: true

Our workflow-controller-configmap trimmed down

workflowDefaults:
  spec:
    activeDeadlineSeconds: 900
    archiveLogs: true
    securityContext:
      runAsGroup: 2888
      runAsNonRoot: true
      runAsUser: 2888
    serviceAccountName: argo-workflow-sa
    ttlStrategy:
      secondsAfterCompletion: 10800

... and our artifact-repositories

s3:
  bucket: artifacts
  endpoint: minio.argo.svc.cluster.local:9000
  accessKeySecret:
    name: minio-credentials
    key: accesskey
  secretKeySecret:
    name: minio-credentials
    key: secretkey

There are some things that don't quite align though -

We do not set the main-logs name anywhere in our codebase either
It's curious that the name .main-logs does appear in Argo Workflow codebase for various _tests though (NB: the "s" in log"s")
It feels like the name .main-logs is somehow a default (from where?) that gets applied when we set archiveLogs: true without supplying a name in workflow-controller-configmap as is the case with our setup.

I should also describe then:-

Our Argo Workflows gets deployed from a Helm chart
The string main-logs does not seem to appear in the argo-helm repo either.

We discovered this technique of accessing main-logs after noticing it was referenced in the INPUTS/OUTPUTS tab of the related step in the Argo Workflows user interface, and followed that through until we discovered it was available as an artifact - was not aware this is undocumented, it just made sense at the time.

The reason we use this main-logs artifact approach is that the parameter based outputs.result approach fails when terminal outputs contain anything binary.

See the updated example below (on v3.4.11) that fails if you uncomment the 2x parameters items below - it seems that the outputs.result parameter passing approach will fail with a MESSAGE cannot finish template replacement because the result was invalid JSON


apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate

metadata:
  name: bugreport-20231023a

spec:
  entrypoint: entrypoint-handler

  templates:
    - name: entrypoint-handler
      steps:

        - - name: step1
            template: step1-template

        - - name: step2
            template: step2-template
            arguments:
              artifacts: [ { name: artifact_from_step1, from: "{{ steps.step1.outputs.artifacts.main-logs }}" } ]
              # parameters: [ { name: parameter_from_step1, value: "{{ steps.step1.outputs.result }}" } ]

    - name: step1-template
      container:
        image: debian:stable-slim
        command: ["/bin/bash", "-c"]
        args:
          - |
            echo "Binary output to STDOUT"
            head -c256 /dev/urandom

            echo "Binary output to STDERR"
            >&2 head -c256 /dev/urandom

    - name: step2-template
      inputs:
        artifacts: [ { name: artifact_from_step1, path: "/tmp/artifact_from_step1" } ]
        # parameters: [ { name: parameter_from_step1 } ]
      container:
        image: debian:stable-slim
        command: ["/bin/bash", "-c"]
        args:
          - |
            cat "/tmp/artifact_from_step1"

To boil all that down into summary:-

we stumbled into our main-logs artifacts approach because we discovered that we could not safely pass logs from one step to another using the parameter based outputs.result approach
We were concerned that specially crafted log output from step1 passed through to step2 as an input parameter might be able to inject unexpected values in step2, thus the idea of passing the log-file as an artifact reduced that risk and was therefore preferable.

agilgur5 commented 11 months ago

Thanks for digging in and providing that detail @ndejong!

Since you confirmed it wasn't in your codebase but that you did have archiveLogs enabled, I searched a bit harder and eventually found the line in Argo's codebase where this is created. It's dynamically created as containerName + "-logs", so it did not quite pop up in a search for main-logs (and the tests had explicit name fields too) 😅. init, main, and wait are the default names for Argo's Executor containers (as you may have noticed).

@shmruin we may need to add GetMainContainerNames() + -logs to the validation logic when archiveLogs is turned on. I'm not sure if that requires any run-time knowledge or can be statically defined though.

This is still relying on undocumented behavior, so I am a little hesitant to include this, but it is currently a regression since that behavior was allowed before (despite being undocumented). @terrytangyuan any thoughts on allowing the undocumented behavior? I'm leaning toward allowing it

ndejong commented 11 months ago

Is this "just" a matter of documentation or does it run deeper that that? (sorry, I'm not across the Argo code base)

As described above, we discovered the documented parameter outputs.result approach is not appropriate for our use case where-as our funky "reach-back" artifact based logs approach addressed our problem and reduced the scope for unexpected injection hazards in subsequent steps.

Would be happy to create documentation if that's all that's required.

agilgur5 commented 11 months ago

Is this "just" a matter of documentation or does it run deeper that that? (sorry, I'm not across the Argo code base)

it would be deeper than that. it may have never really been intended to be exposed to users, I'm not sure. archiveLogs creates an artifact under-the-hood -- and you happened to figure out and use that internal functionality.

it being "undocumented behavior" is less that it lacks documentation and more that it may never have been a documented "feature" to begin with. as in, it may have just been intended as an internal, and not a user-facing feature.

you're currently using it as a feature, so there's a decision to be made here -- should we support that because it happened to be possible before? or should we not support that as that's an internal and was never actually part of the spec? meaning that your prior usage was unintended behavior.

That's why I tagged Terry as one of the current Leads of Workflows. I lean toward allowing that behavior, but would defer to Terry on a decision there.

argoproj / argo-workflows