argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.92k stars 3.18k forks source link

use label variables in metrics #13141

Open stephen-dahl opened 3 months ago

stephen-dahl commented 3 months ago

Pre-requisites

What happened/what did you expect to happen?

https://cloud-native.slack.com/archives/C01QW9QSSSK/p1717447990226269

I tried using "{{=workflow.labels['workflows.argoproj.io/cron-workflow']}}" as a metric label inside a workflow created by a cron and received error MetricsError: unable to substitute parameters for metric 'exec_count': failed to evaluate expression "workflow.labels['workflows.argoproj.io/cron-workflow']"

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: foo
spec:
  ...
  workflowSpec:
    metrics:
      prometheus:
        - name: exec_count
          help: "Count of workflow execution by result status - workflow level"
          labels:
            - key: name
              value: "{{=workflow.labels['workflows.argoproj.io/cron-workflow']}}"
            - key: status
              value: "{{workflow.status}}"
          counter:
            value: "1"

Logs from the workflow controller

time="2024-06-04T18:05:19.505Z" level=info msg="Processing workflow" Phase= ResourceVersion=50667015 namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.512Z" level=info msg="Task-result reconciliation" namespace=cron numObjs=0 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.512Z" level=info msg="Updated phase  -> Running" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.512Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.513Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.513Z" level=info msg="DAG node oltp-backup-to-olap-ingest-xnlls initialized Running" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.513Z" level=info msg="All of node oltp-backup-to-olap-ingest-xnlls.backup dependencies [] completed" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.513Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.513Z" level=info msg="Pod node oltp-backup-to-olap-ingest-xnlls-411105871 initialized Pending" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.555Z" level=info msg="Created pod: oltp-backup-to-olap-ingest-xnlls.backup (oltp-backup-to-olap-ingest-xnlls--411105871)" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.555Z" level=info msg="All of node oltp-backup-to-olap-ingest-xnlls.delete-old-backups dependencies [] completed" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.555Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.555Z" level=info msg="Pod node oltp-backup-to-olap-ingest-xnlls-2719863104 initialized Pending" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.593Z" level=info msg="Created pod: oltp-backup-to-olap-ingest-xnlls.delete-old-backups (oltp-backup-to-olap-ingest-xnlls--2719863104)" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.593Z" level=info msg="TaskSet Reconciliation" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.593Z" level=info msg=reconcileAgentPod namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:19.593Z" level=info msg="Workflow to be dehydrated" Workflow Size=3567
time="2024-06-04T18:05:19.610Z" level=info msg="Workflow update successful" namespace=cron phase=Running resourceVersion=50667028 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:23.201Z" level=info msg="Processing cron/oltp-backup-to-olap-ingest" cronWorkflow=cron/oltp-backup-to-olap-ingest
time="2024-06-04T18:05:23.202Z" level=info msg="CronWorkflow cron/oltp-backup-to-olap-ingest added" cronWorkflow=cron/oltp-backup-to-olap-ingest
time="2024-06-04T18:05:29.556Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=50667028 namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:29.557Z" level=info msg="Task-result reconciliation" namespace=cron numObjs=0 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:29.557Z" level=info msg="node changed" namespace=cron new.message="Unschedulable: 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.." new.phase=Pending new.progress=0/1 nodeID=oltp-backup-to-olap-ingest-xnlls-2719863104 old.message= old.phase=Pending old.progress=0/1 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:29.557Z" level=info msg="node changed" namespace=cron new.message="Unschedulable: 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.." new.phase=Pending new.progress=0/1 nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 old.message= old.phase=Pending old.progress=0/1 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:29.557Z" level=info msg="TaskSet Reconciliation" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:29.557Z" level=info msg=reconcileAgentPod namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:29.557Z" level=info msg="Workflow to be dehydrated" Workflow Size=4189
time="2024-06-04T18:05:29.580Z" level=info msg="Workflow update successful" namespace=cron phase=Running resourceVersion=50667105 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:39.576Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=50667105 namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:39.577Z" level=info msg="Task-result reconciliation" namespace=cron numObjs=0 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:39.577Z" level=info msg="node unchanged" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-2719863104 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:39.577Z" level=info msg="node unchanged" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:39.578Z" level=info msg="TaskSet Reconciliation" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:05:39.578Z" level=info msg=reconcileAgentPod namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.289Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=50667105 namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.290Z" level=info msg="Task-result reconciliation" namespace=cron numObjs=2 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.290Z" level=info msg="task-result changed" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.290Z" level=info msg="task-result changed" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-2719863104 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.290Z" level=info msg="node changed" namespace=cron new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=oltp-backup-to-olap-ingest-xnlls-2719863104 old.message="Unschedulable: 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.." old.phase=Pending old.progress=0/1 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.290Z" level=info msg="node changed" namespace=cron new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 old.message="Unschedulable: 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.." old.phase=Pending old.progress=0/1 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.291Z" level=info msg="TaskSet Reconciliation" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.291Z" level=info msg=reconcileAgentPod namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:06.291Z" level=info msg="Workflow to be dehydrated" Workflow Size=4052
time="2024-06-04T18:06:06.320Z" level=info msg="Workflow update successful" namespace=cron phase=Running resourceVersion=50667577 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.762Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=50667577 namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.763Z" level=info msg="Task-result reconciliation" namespace=cron numObjs=2 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.763Z" level=info msg="task-result changed" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.763Z" level=info msg="task-result changed" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-2719863104 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.763Z" level=info msg="node changed" namespace=cron new.message= new.phase=Succeeded new.progress=0/1 nodeID=oltp-backup-to-olap-ingest-xnlls-2719863104 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.763Z" level=info msg="node changed" namespace=cron new.message= new.phase=Running new.progress=0/1 nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.764Z" level=info msg="TaskSet Reconciliation" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.764Z" level=info msg=reconcileAgentPod namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:22.764Z" level=info msg="Workflow to be dehydrated" Workflow Size=4356
time="2024-06-04T18:06:22.785Z" level=info msg="Workflow update successful" namespace=cron phase=Running resourceVersion=50667715 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.786Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=50667715 namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.786Z" level=info msg="Task-result reconciliation" namespace=cron numObjs=2 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.786Z" level=info msg="task-result changed" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.786Z" level=info msg="task-result changed" namespace=cron nodeID=oltp-backup-to-olap-ingest-xnlls-2719863104 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="Pod failed: Error (exit code 254)" displayName=backup namespace=cron pod=oltp-backup-to-olap-ingest-xnlls--411105871 templateName= workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="node changed" namespace=cron new.message="Error (exit code 254)" new.phase=Failed new.progress=0/1 nodeID=oltp-backup-to-olap-ingest-xnlls-411105871 old.message= old.phase=Running old.progress=0/1 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="Outbound nodes of oltp-backup-to-olap-ingest-xnlls set to [oltp-backup-to-olap-ingest-xnlls-411105871 oltp-backup-to-olap-ingest-xnlls-2719863104]" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="node oltp-backup-to-olap-ingest-xnlls phase Running -> Failed" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="node oltp-backup-to-olap-ingest-xnlls finished: 2024-06-04 18:06:32.78750981 +0000 UTC" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="TaskSet Reconciliation" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg=reconcileAgentPod namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="Updated phase Running -> Failed" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="Marking workflow completed" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=error msg="unable to substitute parameters for metric 'exec_count': failed to evaluate expression \"workflow.labels['workflows.argoproj.io/cron-workflow']\"" namespace=cron workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:32.787Z" level=info msg="Workflow to be dehydrated" Workflow Size=4724
time="2024-06-04T18:06:32.793Z" level=info msg="cleaning up pod" action=deletePod key=cron/oltp-backup-to-olap-ingest-xnlls-1340600742-agent/deletePod
time="2024-06-04T18:06:32.802Z" level=info msg="Queueing Failed workflow cron/oltp-backup-to-olap-ingest-xnlls for delete in 24h0m0s due to TTL"
time="2024-06-04T18:06:32.803Z" level=info msg="Workflow update successful" namespace=cron phase=Failed resourceVersion=50667795 workflow=oltp-backup-to-olap-ingest-xnlls
time="2024-06-04T18:06:33.222Z" level=info msg="Deleted Workflow 'oltp-backup-to-olap-ingest-gt97s' due to CronWorkflow 'oltp-backup-to-olap-ingest' history limit" namespace=cron workflow=oltp-backup-to-olap-ingest
time="2024-06-04T18:06:33.232Z" level=info msg="Processing cron/oltp-backup-to-olap-ingest" cronWorkflow=cron/oltp-backup-to-olap-ingest
time="2024-06-04T18:06:33.233Z" level=info msg="CronWorkflow cron/oltp-backup-to-olap-ingest added" cronWorkflow=cron/oltp-backup-to-olap-ingest

Logs from in your workflow's wait container

time="2024-06-04T18:06:03.887Z" level=info msg="Starting Workflow Executor" version=v3.5.6
time="2024-06-04T18:06:03.891Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-06-04T18:06:03.891Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=cron podName=oltp-backup-to-olap-ingest-xnlls--411105871 templateName= version="&Version{Version:v3.5.6,BuildDate:2024-04-19T20:54:43Z,GitCommit:555030053825dd61689a086cb3c2da329419325a,GitTag:v3.5.6,GitTreeState:clean,GoVersion:go1.21.9,Compiler:gc,Platform:linux/amd64,}"
time="2024-06-04T18:06:03.912Z" level=info msg="Starting deadline monitor"
time="2024-06-04T18:06:21.921Z" level=info msg="Main container completed" error="<nil>"
time="2024-06-04T18:06:21.921Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-06-04T18:06:21.921Z" level=info msg="No output parameters"
time="2024-06-04T18:06:21.921Z" level=info msg="No output artifacts"
time="2024-06-04T18:06:21.922Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: oltp-backup-to-olap-ingest-xnlls/oltp-backup-to-olap-ingest-xnlls--411105871/main.log"
time="2024-06-04T18:06:21.931Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2024-06-04T18:06:21.995Z" level=info msg="Saving file to s3" bucket=REDACTED endpoint=s3.amazonaws.com key=oltp-backup-to-olap-ingest-xnlls/oltp-backup-to-olap-ingest-xnlls--411105871/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-06-04T18:06:22.080Z" level=info msg="Save artifact" artifactName=main-logs duration=158.513309ms error="<nil>" key=oltp-backup-to-olap-ingest-xnlls/oltp-backup-to-olap-ingest-xnlls--411105871/main.log
time="2024-06-04T18:06:22.080Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-06-04T18:06:22.081Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-06-04T18:06:22.100Z" level=info msg="Alloc=9313 TotalAlloc=17333 Sys=23141 NumGC=4 Goroutines=12"
agilgur5 commented 3 months ago

Follow-up from this Slack thread.

I tried using "{{workflow.labels.workflows.argoproj.io/cron-workflow}}"

received error MetricsError: unable to substitute parameters for metric 'exec_count': failed to resolve {{=workflow.labels['workflows.argoproj.io/cron-workflow']}}, unable to substitute parameters for metric 'exec_duration': failed to resolve {{=workflow.labels['workflows.argoproj.io/cron-workflow']}}

            value: "{{workflow.labels.workflows.argoproj.io/creator}}"

The description here seems to have copy+pasted a few different ones, so it's not exactly clear which variants you tried and which you didn't. All 3 of the above are different.

dot syntax works on plain Workflow (not CronWorkflow)

In the thread, @tico24 wrote that this works:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: wibblewobble-
spec:
  metrics:
    prometheus:
      - name: exec_count
        help: "Count of workflow execution by result status - workflow level"
        labels:
          - key: name
            value: "{{workflow.labels.workflows.argoproj.io/creator}}"
          - key: status
            value: "{{workflow.status}}"
        counter:
          value: "1"

but not when using as a workflowSpec of a CronWorkflow. Tim's hypothesis there was: "My assumption is that the metrics parameters are unpacked before the cron label is applied to the running workflow, so it can't find the label - hence the error. I would agree that this is an issue."

expr brackets syntax

I suggested using the expr style brackets, i.e. {{=workflow.labels['workflows.argoproj.io/cron-workflow']}}. Per the error it seems like that didn't work either, although not entirely clear due to copy+paste errors as per above

I'm also not sure if brackets are potentially affected by https://github.com/argoproj/argo-workflows/issues/11549

stephen-dahl commented 3 months ago

ya im all over the place, working on cleaning it up... sorry

stephen-dahl commented 3 months ago

the description has been updated with a new run ensuring consistency between all the parts

stephen-dahl commented 3 months ago

did some testing

agilgur5 commented 3 months ago

the middle two shouldn't work, so those are correct.

the last two I would expect to work, so are buggy.

Regarding the first two -- those worked on a CronWorkflow as well? Since Tim said it didn't work when he tried a Workflow generated by a CronWorkflow

stephen-dahl commented 3 months ago

all of my testing was done by setting the metrics in the controller config workflow defaults, creating a cronworkflow, and submitting it through the UI

crileroro commented 2 months ago

I have slightly similar issue to this one and https://github.com/argoproj/argo-workflows/issues/3989 when using a workflowTemplate and a custom label in it. image

Although the label monitoring is set in the workflowTemplate, it is not resolved. However, if I manually add this label in the UI as in the image below, it works. image

It seems that the labels defined in the manifests are not being considered.