argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.08k stars 3.2k forks source link

use label variables in metrics #13141

Open stephen-dahl opened 5 months ago

stephen-dahl commented 5 months ago

Pre-requisites

What happened/what did you expect to happen?

https://cloud-native.slack.com/archives/C01QW9QSSSK/p1717447990226269

I tried using "{{=workflow.labels['workflows.argoproj.io/cron-workflow']}}" as a metric label inside a workflow created by a cron and received error MetricsError: unable to substitute parameters for metric 'exec_count': failed to evaluate expression "workflow.labels['workflows.argoproj.io/cron-workflow']"

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: foo
spec:
  ...
  workflowSpec:
    metrics:
      prometheus:
        - name: exec_count
          help: "Count of workflow execution by result status - workflow level"
          labels:
            - key: name
              value: "{{=workflow.labels['workflows.argoproj.io/cron-workflow']}}"
            - key: status
              value: "{{workflow.status}}"
          counter:
            value: "1"

Logs from the workflow controller

time="2024-06-04T18:05:19.505Z" level=info msg="Processing workflow" Phase= ResourceVersion=50667015 namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.512Z" level=info msg="Task-result reconciliation" namespace=cron numObjs=0 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.512Z" level=info msg="Updated phase  -> Running" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.512Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.513Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.513Z" level=info msg="DAG node oltp-backup-to-olap-ingest-xnlls initialized Running" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.513Z" level=info msg="All of node oltp-backup-to-olap-ingest-xnlls.backup dependencies [] completed" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.513Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.513Z" level=info msg="Pod node oltp-backup-to-olap-ingest-xnlls-411105871 initialized Pending" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.555Z" level=info msg="Created pod: oltp-backup-to-olap-ingest-xnlls.backup (oltp-backup-to-olap-ingest-xnlls--411105871)" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.555Z" level=info msg="All of node oltp-backup-to-olap-ingest-xnlls.delete-old-backups dependencies [] completed" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.555Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.555Z" level=info msg="Pod node oltp-backup-to-olap-ingest-xnlls-2719863104 initialized Pending" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.593Z" level=info msg="Created pod: oltp-backup-to-olap-ingest-xnlls.delete-old-backups (oltp-backup-to-olap-ingest-xnlls--2719863104)" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.593Z" level=info msg="TaskSet Reconciliation" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.593Z" level=info msg=reconcileAgentPod namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.593Z" level=info msg="Workflow to be dehydrated" Workflow Size=3567
time="2024-06-04T18:05:19.610Z" level=info msg="Workflow update successful" namespace=cron phase=Running resourceVersion=50667028 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:23.201Z" level=info msg="Processing cron/oltp-backup-to-olap-ingest" cronWorkflow=cron/oltp-backup-to-olap-ingest
time="2024-06-04T18:05:23.202Z" level=info msg="CronWorkflow cron/oltp-backup-to-olap-ingest added" cronWorkflow=cron/oltp-backup-to-olap-ingest
time="2024-06-04T18:05:29.556Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=50667028 namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:29.557Z" level=info msg="Task-result reconciliation" namespace=cron numObjs=0 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:29.557Z" level=info msg="node changed" namespace=cron new.message="Unschedulable: 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.." new.phase=Pending new.progress=0/1 nodeID=oltp-backup-to-olap-ingest-xnlls-2719863104 old.message= old.phase=Pending old.progress=0/1 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:29.557Z" level=info msg="node changed" namespace=cron new.message="Unschedulable: 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.." new.phase=Pending new.progress=0/1 nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 old.message= old.phase=Pending old.progress=0/1 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:29.557Z" level=info msg="TaskSet Reconciliation" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:29.557Z" level=info msg=reconcileAgentPod namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:29.557Z" level=info msg="Workflow to be dehydrated" Workflow Size=4189
time="2024-06-04T18:05:29.580Z" level=info msg="Workflow update successful" namespace=cron phase=Running resourceVersion=50667105 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:39.576Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=50667105 namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:39.577Z" level=info msg="Task-result reconciliation" namespace=cron numObjs=0 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:39.577Z" level=info msg="node unchanged" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-2719863104 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:39.577Z" level=info msg="node unchanged" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:39.578Z" level=info msg="TaskSet Reconciliation" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:39.578Z" level=info msg=reconcileAgentPod namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.289Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=50667105 namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.290Z" level=info msg="Task-result reconciliation" namespace=cron numObjs=2 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.290Z" level=info msg="task-result changed" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.290Z" level=info msg="task-result changed" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-2719863104 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.290Z" level=info msg="node changed" namespace=cron new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=oltp-backup-to-olap-ingest-xnlls-2719863104 old.message="Unschedulable: 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.." old.phase=Pending old.progress=0/1 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.290Z" level=info msg="node changed" namespace=cron new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 old.message="Unschedulable: 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.." old.phase=Pending old.progress=0/1 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.291Z" level=info msg="TaskSet Reconciliation" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.291Z" level=info msg=reconcileAgentPod namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.291Z" level=info msg="Workflow to be dehydrated" Workflow Size=4052
time="2024-06-04T18:06:06.320Z" level=info msg="Workflow update successful" namespace=cron phase=Running resourceVersion=50667577 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.762Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=50667577 namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.763Z" level=info msg="Task-result reconciliation" namespace=cron numObjs=2 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.763Z" level=info msg="task-result changed" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.763Z" level=info msg="task-result changed" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-2719863104 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.763Z" level=info msg="node changed" namespace=cron new.message= new.phase=Succeeded new.progress=0/1 nodeID=oltp-backup-to-olap-ingest-xnlls-2719863104 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.763Z" level=info msg="node changed" namespace=cron new.message= new.phase=Running new.progress=0/1 nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.764Z" level=info msg="TaskSet Reconciliation" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.764Z" level=info msg=reconcileAgentPod namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.764Z" level=info msg="Workflow to be dehydrated" Workflow Size=4356
time="2024-06-04T18:06:22.785Z" level=info msg="Workflow update successful" namespace=cron phase=Running resourceVersion=50667715 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.786Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=50667715 namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.786Z" level=info msg="Task-result reconciliation" namespace=cron numObjs=2 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.786Z" level=info msg="task-result changed" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.786Z" level=info msg="task-result changed" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-2719863104 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="Pod failed: Error (exit code 254)" displayName=backup namespace=cron pod=oltp-backup-to-olap-ingest-xnlls--411105871 templateName= workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="node changed" namespace=cron new.message="Error (exit code 254)" new.phase=Failed new.progress=0/1 nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 old.message= old.phase=Running old.progress=0/1 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="Outbound nodes of oltp-backup-to-olap-ingest-xnlls set to [oltp-backup-to-olap-ingest-xnlls-411105871 oltp-backup-to-olap-ingest-xnlls-2719863104]" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="node oltp-backup-to-olap-ingest-xnlls phase Running -> Failed" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="node oltp-backup-to-olap-ingest-xnlls finished: 2024-06-04 18:06:32.78750981 +0000 UTC" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="TaskSet Reconciliation" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg=reconcileAgentPod namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="Updated phase Running -> Failed" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="Marking workflow completed" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=error msg="unable to substitute parameters for metric 'exec_count': failed to evaluate expression \"workflow.labels['workflows.argoproj.io/cron-workflow']\"" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="Workflow to be dehydrated" Workflow Size=4724
time="2024-06-04T18:06:32.793Z" level=info msg="cleaning up pod" action=deletePod key=cron/oltp-backup-to-olap-ingest-xnlls-1340600742-agent/deletePod
time="2024-06-04T18:06:32.802Z" level=info msg="Queueing Failed workflow cron/oltp-backup-to-olap-ingest-xnlls for delete in 24h0m0s due to TTL"
time="2024-06-04T18:06:32.803Z" level=info msg="Workflow update successful" namespace=cron phase=Failed resourceVersion=50667795 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:33.222Z" level=info msg="Deleted Workflow 'oltp-backup-to-olap-ingest-gt97s' due to CronWorkflow 'oltp-backup-to-olap-ingest' history limit" namespace=cron workflow=oltp-backup-to-olap-ingest
time="2024-06-04T18:06:33.232Z" level=info msg="Processing cron/oltp-backup-to-olap-ingest" cronWorkflow=cron/oltp-backup-to-olap-ingest
time="2024-06-04T18:06:33.233Z" level=info msg="CronWorkflow cron/oltp-backup-to-olap-ingest added" cronWorkflow=cron/oltp-backup-to-olap-ingest

Logs from in your workflow's wait container

time="2024-06-04T18:06:03.887Z" level=info msg="Starting Workflow Executor" version=v3.5.6
time="2024-06-04T18:06:03.891Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-06-04T18:06:03.891Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=cron podName=oltp-backup-to-olap-ingest-xnlls--411105871 templateName= version="&Version{Version:v3.5.6,BuildDate:2024-04-19T20:54:43Z,GitCommit:555030053825dd61689a086cb3c2da329419325a,GitTag:v3.5.6,GitTreeState:clean,GoVersion:go1.21.9,Compiler:gc,Platform:linux/amd64,}"
time="2024-06-04T18:06:03.912Z" level=info msg="Starting deadline monitor"
time="2024-06-04T18:06:21.921Z" level=info msg="Main container completed" error="<nil>"
time="2024-06-04T18:06:21.921Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-06-04T18:06:21.921Z" level=info msg="No output parameters"
time="2024-06-04T18:06:21.921Z" level=info msg="No output artifacts"
time="2024-06-04T18:06:21.922Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: oltp-backup-to-olap-ingest-xnlls/oltp-backup-to-olap-ingest-xnlls--411105871/main.log"
time="2024-06-04T18:06:21.931Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2024-06-04T18:06:21.995Z" level=info msg="Saving file to s3" bucket=REDACTED endpoint=s3.amazonaws.com key=oltp-backup-to-olap-ingest-xnlls/oltp-backup-to-olap-ingest-xnlls--411105871/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-06-04T18:06:22.080Z" level=info msg="Save artifact" artifactName=main-logs duration=158.513309ms error="<nil>" key=oltp-backup-to-olap-ingest-xnlls/oltp-backup-to-olap-ingest-xnlls--411105871/main.log
time="2024-06-04T18:06:22.080Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-06-04T18:06:22.081Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-06-04T18:06:22.100Z" level=info msg="Alloc=9313 TotalAlloc=17333 Sys=23141 NumGC=4 Goroutines=12"
agilgur5 commented 5 months ago

Follow-up from this Slack thread.

I tried using "{{workflow.labels.workflows.argoproj.io/cron-workflow}}"

received error MetricsError: unable to substitute parameters for metric 'exec_count': failed to resolve {{=workflow.labels['workflows.argoproj.io/cron-workflow']}}, unable to substitute parameters for metric 'exec_duration': failed to resolve {{=workflow.labels['workflows.argoproj.io/cron-workflow']}}

            value: "{{workflow.labels.workflows.argoproj.io/creator}}"

The description here seems to have copy+pasted a few different ones, so it's not exactly clear which variants you tried and which you didn't. All 3 of the above are different.

dot syntax works on plain Workflow (not CronWorkflow)

In the thread, @tico24 wrote that this works:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: wibblewobble-
spec:
  metrics:
    prometheus:
      - name: exec_count
        help: "Count of workflow execution by result status - workflow level"
        labels:
          - key: name
            value: "{{workflow.labels.workflows.argoproj.io/creator}}"
          - key: status
            value: "{{workflow.status}}"
        counter:
          value: "1"

but not when using as a workflowSpec of a CronWorkflow. Tim's hypothesis there was: "My assumption is that the metrics parameters are unpacked before the cron label is applied to the running workflow, so it can't find the label - hence the error. I would agree that this is an issue."

expr brackets syntax

I suggested using the expr style brackets, i.e. {{=workflow.labels['workflows.argoproj.io/cron-workflow']}}. Per the error it seems like that didn't work either, although not entirely clear due to copy+paste errors as per above

I'm also not sure if brackets are potentially affected by https://github.com/argoproj/argo-workflows/issues/11549

stephen-dahl commented 5 months ago

ya im all over the place, working on cleaning it up... sorry

stephen-dahl commented 5 months ago

the description has been updated with a new run ensuring consistency between all the parts

stephen-dahl commented 5 months ago

did some testing

agilgur5 commented 5 months ago

the middle two shouldn't work, so those are correct.

the last two I would expect to work, so are buggy.

Regarding the first two -- those worked on a CronWorkflow as well? Since Tim said it didn't work when he tried a Workflow generated by a CronWorkflow

stephen-dahl commented 5 months ago

all of my testing was done by setting the metrics in the controller config workflow defaults, creating a cronworkflow, and submitting it through the UI

crileroro commented 3 months ago

I have slightly similar issue to this one and https://github.com/argoproj/argo-workflows/issues/3989 when using a workflowTemplate and a custom label in it. image

Although the label monitoring is set in the workflowTemplate, it is not resolved. However, if I manually add this label in the UI as in the image below, it works. image

It seems that the labels defined in the manifests are not being considered.

Gitopolis commented 2 weeks ago

{{workflow.labels.workflows.argoproj.io/cron-workflow}}: worked

In my case this didn't work, Argo v3.6.0-rc2

Screenshot 2024-10-28 at 9 39 55 AM
tvandinther commented 1 week ago

I am also encountering this issue using the expr syntax.

The log from the workflow-controller:

time="2024-11-07T10:48:08.354Z" level=error msg="unable to substitute parameters for metric 'workflow_completed_count': failed to evaluate expression \" workflow.labels[\\\\\\\"example.com/environment-name\\\\\\\"] \"" namespace=workspace-workflow-reference workflow=workflow-reference-s9475 

The expression:

value: '{{= workflow.labels["example.com/environment-name"] }}'

I tested the expression on expr Playground on this data and it worked as expected:

workflow:
  labels:
    example.com/environment-name: "foo"
tvandinther commented 1 week ago

For those looking for immediate solutions, the workaround which works for me is to use workflow.labels.json along with fromJSON() from expr.

value: '{{= fromJSON(workflow.labels.json)["example.com/environment-name"] }}'

Another thing to note is that labels on the workflowTemplate referenced at spec.workflowTemplateRef of the running workflow are not transferred to the workflow resource. So for this to work you need to make sure you set the labels on the workflow as well by setting them via the spec.workflowMetadata.labels field on the workflowTemplate.