argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.87k stars 3.17k forks source link

Real-time gauge metrics would be useful #12243

Open Joibel opened 10 months ago

Joibel commented 10 months ago

Summary

It isn't possible to emit a real-time gauge metric for counting how many of a workflow are currently active. And probably other things than just workflow.duration which is the only possible one right now.

Use Cases

If you'd like a prometheus metric per workflow category (ignoring how we define categories) you must place a dummy task at the start of a workflow run to emit that counter. It would be far saner to just emit it as a workflow level counter metric.


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

caelan-io commented 9 months ago

@agilgur5 - Anton, I'm curious how you solved for this in your prior workflow platform engineering life. 😄

agilgur5 commented 9 months ago

We didn't necessarily have a problem with the metrics available. Most of our "golden metrics" were actually from our multi-cluster control plane API, which used the Go Gin exporter for its metrics. It would fail over between our regional Argo Workflows instances and was the central user-facing API, so it was the bigger concern monitoring-wise.

That and observability just doesn't often get prioritized 😕 I had assigned ~3 engineers at different times, and that is what got us to have the exporter for metrics and OTel instrumentation for tracing in our API. The third project was to get self-service metrics and dashboards for data scientists (which would include Argo's metrics), and that only got started after my time. I wrote some New Relic queries in the interim (and Splunk queries were built-in to the platform early on) so that a model and particular deployment version (corresponding to a well-labeled WorkflowTemplate) could be looked up, but data scientists were on their own from there 🙃

caelan-io commented 9 months ago

Makes sense! Yes, if users have built their own control plane layer, it makes gathering workflow metrics easier. This is what we do at Pipekit as well with our control plane.

We were thinking that some features could be added upstream to make some more workflow metrics easier for end users to access in the event they don't have the control plane set up.

tvandinther commented 6 months ago

In the organisation I am implementing a data processing platform, the people responsible for building the processing flows are different from the people monitoring the execution of those flows. Typically there is a live control center who care primarily about the existence of the output data, but would benefit from dashboards that report intermediate problems so that they can triage and contact the right people.

For this reason, having a way to take a live view into what is happening is very valuable, and we don't want to enforce the usage of yet another GUI (in this case Argo Workflows UI) for this purpose. Rather, utilising the metrics and consuming them centrally is the way that the architects have chosen to move forward.

Hopefully this perspective helps. In short, I am in favour of this as it would make implementation of live observability easier.