argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.53k stars 3.12k forks source link

expression `{{= workflow?.outputs?.parameters?.my_param ?? 'na' }}` always return `na` #11549

Open DocX opened 10 months ago

DocX commented 10 months ago

Pre-requisites

What happened/what you expected to happen?

Running workflow with global output parameter, that is used in metrics label value with expression containing "??" always evaluates as if the parameter value is nil. The reason for using the condition is to support step failing, when the outputs are not generated and in that case I want to fallback to "na".

Running the workflow below, after completed, the value of .spec.metrics is

{
  "prometheus": [
    {
      "name": "test_metric",
      "labels": [
        {
          "key": "param",
          "value": "na"
        }
      ],
      "help": "Test metric",
      "counter": {
        "value": "1"
      }
    }
  ]
}

Expected value is:

{
  "prometheus": [
    {
      "name": "test_metric",
      "labels": [
        {
          "key": "param",
          "value": "hello world"
        }
      ],
      "help": "Test metric",
      "counter": {
        "value": "1"
      }
    }
  ]
}

To check, the value of .status.outputs is correct as expected:

{
  "parameters": [
    {
      "name": "my_global_param",
      "value": "hello world"
    }
  ]
}

When setting the expression to "{{= workflow?.outputs?.parameters?.my_global_param }}" (i.e. removing the safe conditional), the evaluated value in the metric label is correct.

Version

v3.4.9

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: global-outputs-
spec:
  entrypoint: generate-globals
  serviceAccountName: argo

  templates:
  - name: generate-globals
    steps:
    - - name: generate
        template: global-output
  - name: global-output
    container:
      image: alpine:3.7
      command: [sh, -c]
      args: ["sleep 1; echo -n hello world > /tmp/hello_world.txt"]
    outputs:
      parameters:
      - name: hello-param
        valueFrom:
          path: /tmp/hello_world.txt
        globalName: my_global_param

  metrics:
    prometheus:
      - name: test_metric
        help: Test metric
        counter:
          value: "1"
        labels:
        - key: param
          value: "{{= workflow?.outputs?.parameters?.my_global_param ?? 'na' }}"

Logs from the workflow controller

time="2023-08-09T12:00:14.976Z" level=info msg="Processing workflow" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:14.984Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:14.987Z" level=info msg="Steps node global-outputs-llzsn initialized Running" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:14.987Z" level=info msg="StepGroup node global-outputs-llzsn-482890107 initialized Running" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:14.993Z" level=info msg="Pod node global-outputs-llzsn-2540520120 initialized Pending" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:15.025Z" level=info msg="Created pod: global-outputs-llzsn[0].generate (global-outputs-llzsn-global-output-2540520120)" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:15.026Z" level=info msg="Workflow step group node global-outputs-llzsn-482890107 not yet completed" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:15.026Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:15.026Z" level=info msg=reconcileAgentPod namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:15.042Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=1328 workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.030Z" level=info msg="Processing workflow" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.031Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.035Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.037Z" level=info msg="node changed" namespace=argo new.message= new.phase=Succeeded new.progress=0/1 nodeID=global-outputs-llzsn-2540520120 old.message= old.phase=Pending old.progress=0/1 workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.037Z" level=info msg="setting workflow.outputs.parameters.my_global_param: 'hello world'" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.039Z" level=info msg="Step group node global-outputs-llzsn-482890107 successful" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.040Z" level=info msg="node global-outputs-llzsn-482890107 phase Running -> Succeeded" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.040Z" level=info msg="node global-outputs-llzsn-482890107 finished: 2023-08-09 12:00:25.040122772 +0000 UTC" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.040Z" level=info msg="Outbound nodes of global-outputs-llzsn-2540520120 is [global-outputs-llzsn-2540520120]" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.040Z" level=info msg="Outbound nodes of global-outputs-llzsn is [global-outputs-llzsn-2540520120]" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.040Z" level=info msg="node global-outputs-llzsn phase Running -> Succeeded" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.040Z" level=info msg="node global-outputs-llzsn finished: 2023-08-09 12:00:25.040335284 +0000 UTC" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.040Z" level=info msg="Checking daemoned children of global-outputs-llzsn" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.040Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.040Z" level=info msg=reconcileAgentPod namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.040Z" level=info msg="Updated phase Running -> Succeeded" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.040Z" level=info msg="Marking workflow completed" namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.040Z" level=info msg="Checking daemoned children of " namespace=argo workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.046Z" level=info msg="cleaning up pod" action=deletePod key=argo/global-outputs-llzsn-1340600742-agent/deletePod
time="2023-08-09T12:00:25.056Z" level=info msg="Workflow update successful" namespace=argo phase=Succeeded resourceVersion=1366 workflow=global-outputs-llzsn
time="2023-08-09T12:00:25.068Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/global-outputs-llzsn-global-output-2540520120/labelPodCompleted

Logs from in your workflow's wait container

time="2023-08-09T12:00:20.879Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-08-09T12:00:20.879Z" level=info msg="Saving output parameters"
time="2023-08-09T12:00:20.879Z" level=info msg="Saving path output parameter: hello-param"
time="2023-08-09T12:00:20.879Z" level=info msg="Copying /tmp/hello_world.txt from base image layer"
time="2023-08-09T12:00:20.879Z" level=info msg="Successfully saved output parameter: hello-param"
time="2023-08-09T12:00:20.879Z" level=info msg="No output artifacts"
time="2023-08-09T12:00:20.898Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:argo:argo\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"argo\""
time="2023-08-09T12:00:20.919Z" level=info msg="Alloc=8343 TotalAlloc=16613 Sys=28797 NumGC=5 Goroutines=8"
time="2023-08-09T12:00:20.919Z" level=info msg="Deadline monitor stopped"
time="2023-08-09T12:00:20.919Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
terrytangyuan commented 10 months ago

Would adding default parameter value work?

stale[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Jooong commented 8 months ago

Hi, it's also the same story with pod.name. For example, I tried to embed the following content to workflow template to produce prometheus metrics for every pod. Since pod.name is not resolved in non-pod type templates such as steps, I tried to deviate this by using expression ?. and ??. However, this always returns 'na' like the expression mentioned from this issue. Do you have any suggestion for this?

templateDefaults:
  metrics:
    prometheus:
      - name: exec_duration_gauge
        labels:
          - key: workflow_name
            value: "{{workflow.name}}"
          - key: workflow_namespace
            value: "{{workflow.namespace}}"
          - key: pod_name
            value: "{{=pod?.name ?? 'na'}}"
        help: "Duration gauge by pod name"
        gauge:
          realtime: true
          value: "{{duration}}"
agilgur5 commented 8 months ago

Hmm, I wonder if the parser is getting confused by the safe conditional -- since it can't directly look up pod?.name like it can with pod.name. Though IIRC, the entire variable, such as pod, is just passed to expr.

Does re-writing it without the safe conditional short-hand, e.g. as (pod && pod.name) || 'na', work fine?

Jooong commented 8 months ago

Hmm, I wonder if the parser is getting confused by the safe conditional -- since it can't directly look up pod?.name like it can with pod.name. Though IIRC, the entire variable, such as pod, is just passed to expr.

Does re-writing it without the safe conditional short-hand, e.g. as (pod && pod.name) || 'na', work fine?

Thanks for your suggestion, but it seems all node fails with this message.

error: cannot finish template replacement because the result was invalid JSON

There are more messages stated from summary > conditions in the web UI.

MetricsError: unable to substitute parameters for metric 'exec_duration_gauge': failed to evaluate expression: invalid operation: || (mismatched types bool and string) (1:19) | (pod && pod.name) || 'na' | ..................^

Tried {{=pod == nil ? 'na' : pod.name}} as well, but it's always evaluted as 'na'

agilgur5 commented 8 months ago

Thanks for checking that!

That's strange that pod == nil would be returning true... only other guess I might have is that nil is a type, so type(pod) == nil ? 'na' : pod.name might be different?

Jooong commented 8 months ago

Thanks for checking that!

That's strange that pod == nil would be returning true... only other guess I might have is that nil is a type, so type(pod) == nil ? 'na' : pod.name might be different?

I'm afraid this doesn't work either. Attaching the error message.

MetricsError: unable to substitute parameters for metric 'exec_duration_gauge': failed to evaluate expression: reflect: call of reflect.Value.Call on zero Value (1:1) | type(pod) == nil ? "na" : pod.name | ^

I'm not really familiar with the source code, but I guess the string pod.name in the pod template is directly replaced to the actual pod name without definition of pod here?

agilgur5 commented 7 months ago

I'm afraid this doesn't work either. Attaching the error message.

Thanks for testing this out and confirming!

I'm not really familiar with the source code

I know there's a good bit of expr code in scope.go and validate.go, but it actually appears to be scattered in many places. And that's just a search for "expr", which is not quite everything either; ProcessArgs is a common helper too for instance and does not use expr directly (it uses template.Replace, which uses expr in expressionReplace).

It honestly might be ripe for a refactor to consolidate more of the expression code into one place so that there are less bugs and it's easier to modify holistically. (Not to mention some existing large confusions with the different templating engines: #5142, #9529, #7831, etc). I've been meaning to jump into Argo's templating and start addressing lots of issues, but haven't had a chance to do so yet.

but I guess the string pod.name in the pod template is directly replaced to the actual pod name without definition of pod here?

Oh nice find! I think you might be correct on that one, which would explain quite a bit. I think this is the same for workflow.parameters (that variable is used in the setGlobalParameters function) and I think workflow.outputs.parameters as well. workflow itself seems to never be defined.

I think this is because the workflow and pod structs (for example) have a lot more data in them than what is used in expressions and there may not necessarily be a 1-to-1 match in naming either. EDIT: thinking about it a bit more, it might make sense to make a secondary facade struct solely for expressions that could make logic like this issue more straightforward / have less gotchas. Although I remember there being some rationale for it with workflow.parameters.json, c.f. #9742 and this line mentioned there... so easier said than done 😕