Metrics: `workflow_status_phase` (which includes workflow name label) and `workflow_start_time`

argoproj / argo-workflows

Workflow Engine for Kubernetes

https://argo-workflows.readthedocs.io/

Apache License 2.0

15.11k stars 3.21k forks source link

Metrics: `workflow_status_phase` (which includes workflow name label) and `workflow_start_time` #13683

Open tooptoop4 opened 1 month ago

tooptoop4 commented 1 month ago

in prometheus for pods there are metrics like kube_pod_status_phase and kube_pod_start_time

need similar metrics at workflow level

tooptoop4 commented 1 month ago

https://github.com/argoproj/argo-workflows/blob/v2.8.0/workflow/metrics/legacy_metrics.go#L22 and https://github.com/argoproj/argo-workflows/blob/v2.8.0/workflow/metrics/legacy_metrics.go#L40 used to have!

napestershine commented 3 weeks ago

@tooptoop4 is someone working on it or I can investigate it for implementation?

tooptoop4 commented 3 weeks ago

@napestershine u can work on it

Joibel commented 3 weeks ago

There is a big problem with adding workflow name to metrics is that it is very high cardinality - it essentially creates a separate data series for every workflow. All of these data series live in memory of the workflow controller for the lifetime of the workflow controller, and the receiving store will also need to store a separate time series for each one.

I have already implemented some higher cardinality metrics (around namespaces and workflowTemplateRef names) to help with some of the issues you might be attempting to address, but blindly doing this will not be OK.

The issue description doesn't explain why these metrics are needed per workflow.

I am working on tracing for workflow support which may allow some of the metrics you want to be extracted from the traces.

napestershine commented 3 weeks ago

I might be new to this topic. So a simple use case is Lets say I have a cronworkflow and I want to check if it was triggered or not on its schedule.

Joibel commented 3 weeks ago

I might be new to this topic. So a simple use case is Lets say I have a cronworkflow and I want to check if it was triggered or not on its schedule.

This proposal would give you the workflow name from which you'd have to establish the cronworkflow name.

https://argo-workflows.readthedocs.io/en/latest/metrics/#cronworkflows_triggered_total gives you this with much less cardinality.

tooptoop4 commented 3 weeks ago

surely they could be purged from memory if they have been succeeded/error/fail for more than 10mins

napestershine commented 3 weeks ago

@Joibel This feature is available in 3.6.x. Which has not been released yet officially. When can we expect that release?

Joibel commented 3 weeks ago

@Joibel This feature is available in 3.6.x. Which has not been released yet officially. When can we expect that release?

Official answer is, as always "when it's done".

Currently there is an rc3 release out, we need to make an rc4 and then wait 2 weeks. I'd predict the first half of November now, but there aren't any promises.

Please test rc3 and us know how that works for you.

Joibel commented 3 weeks ago

surely they could be purged from memory if they have been succeeded/error/fail for more than 10mins

You have to hack the opentelemetry code to do this as this isn't considered the correct way to implement metrics. We do this for custom metrics already. This only solves one half of the problem though, you're still paying heavily for your metrics storage when cardinality is high.

tooptoop4 commented 3 weeks ago

kube_pod_status_phase is already there with even higher cardinality