Open tooptoop4 opened 1 month ago
@tooptoop4 is someone working on it or I can investigate it for implementation?
@napestershine u can work on it
There is a big problem with adding workflow name to metrics is that it is very high cardinality - it essentially creates a separate data series for every workflow. All of these data series live in memory of the workflow controller for the lifetime of the workflow controller, and the receiving store will also need to store a separate time series for each one.
I have already implemented some higher cardinality metrics (around namespaces and workflowTemplateRef names) to help with some of the issues you might be attempting to address, but blindly doing this will not be OK.
The issue description doesn't explain why these metrics are needed per workflow.
I am working on tracing for workflow support which may allow some of the metrics you want to be extracted from the traces.
I might be new to this topic. So a simple use case is Lets say I have a cronworkflow and I want to check if it was triggered or not on its schedule.
I might be new to this topic. So a simple use case is Lets say I have a cronworkflow and I want to check if it was triggered or not on its schedule.
This proposal would give you the workflow name from which you'd have to establish the cronworkflow name.
https://argo-workflows.readthedocs.io/en/latest/metrics/#cronworkflows_triggered_total gives you this with much less cardinality.
surely they could be purged from memory if they have been succeeded/error/fail for more than 10mins
@Joibel This feature is available in 3.6.x. Which has not been released yet officially. When can we expect that release?
@Joibel This feature is available in 3.6.x. Which has not been released yet officially. When can we expect that release?
Official answer is, as always "when it's done".
Currently there is an rc3 release out, we need to make an rc4 and then wait 2 weeks. I'd predict the first half of November now, but there aren't any promises.
Please test rc3 and us know how that works for you.
surely they could be purged from memory if they have been succeeded/error/fail for more than 10mins
You have to hack the opentelemetry code to do this as this isn't considered the correct way to implement metrics. We do this for custom metrics already. This only solves one half of the problem though, you're still paying heavily for your metrics storage when cardinality is high.
kube_pod_status_phase is already there with even higher cardinality
in prometheus for pods there are metrics like kube_pod_status_phase and kube_pod_start_time
need similar metrics at workflow level