Open ebr opened 2 years ago
Hmm... this is going to be tricky. I think the best answer right now is that you shouldn't use {{status}}
on real time duration metrics. The reason why is mostly implementation: we currently don't record timings between transition specific states (e.g. Pending
-> Running
). The only one we record is when a workflow finishes (Status.FinishedAt
) and compare it to when it started (Status.StartedAt
) and use that information to compute the duration. If Status.FinishedAt
does not exist, we simply compute the difference between now and Status.StartedAt
.
Even if when we fix the bug where the Pending
keeps increasing indefinitely, we currently have no way to give you information on how much time it spent on Pending
state (since we don't record it anywhere).
If you are looking to find a way to see how long your workflows are pending before they are executed, there are a lot of default metric (argo_workflows_queue_*
) which could act as a proxy for this.
The real reason I started this experiment was to see if I can identify any workflow nodes that are currently in Suspended
state, so that the user is given the opportunity to resolve the root cause of some error condition before resuming the workflow. I didn't find a way to do this, because I don't think the Suspended
state is recorded anywhere. (I can open a separate issue about this, unless I'm missing something)
That aside, I think fixing the Pending
increasing indefinitely would still be helpful. Because then the query delta(step_duration_gauge{status="Pending"}[1m])
will return 0
once the Pending series stops increasing in value. So we can that way infer at least that the node has transitioned out of Pending. And extending that to by (workflow_name) max(..)
of that query will tell us if a workflow has any Pending
nodes.
To be clear, this is different from measuring the number of Pending pods (which we do separately), because pending workflow nodes might have not even created a Pod yet. (Correct me if I'm wrong please).
It would be nice to also have the Running
state emitted, for parity with the states that can be filtered on with argo get --status=...
, but that's more of a feature request. I'm going to open one for this as well as my original issue of Suspended
statuses not being recorded, because ultimately I need to use the above PromQL queries with Suspended
statuses above all.
Hope this makes sense?
I didn't find a way to do this, because I don't think the Suspended state is recorded anywhere.
Suspended
is inferred when either the workflow or any node in the workflow has Phase Running
and is a NodeTypeSuspend
. You could identify any nodes by looking at the full workflow yaml status.
That aside, I think fixing the Pending increasing indefinitely would still be helpful.
In this case the value of step_duration_gauge{status="Pending"}
can only be 0
after it transitions out (since we don't record timings).
So we can that way infer at least that the node has transitioned out of Pending.
This might be a misuse of metrics. Telemetry is not meant to answer the state of any individual workflow.
It would be nice to also have the Running state emitted
Sorry, not sure what you mean by emitted here?
Thank you for pointing out NodeTypeSuspend
- this is helpful
You could identify any nodes by looking at the full workflow yaml status
That is true, but I am aiming to make this a better experience for a user without having to parse the status YAML. This is especially important for offloaded workflows (which is the majority our workflows), since you can't simply argo get
them without going through the server.
This might be a misuse of metrics.
I'm not sure I agree - we are trying to answer whether the workflow has any Pending nodes right now. This seems to align with the definitions in the linked article?
nice to also have the Running state emitted
sorry, I mean that currently the value of status
workflow/template variable can be one of: Succeeded
, Pending
, Failed
. (possibly also Error
)? I think it would be helpful to also have it take set to Running
when the node has Phase Running
.
That is true, but I am aiming to make this a better experience for a user without having to parse the status YAML. This is especially important for offloaded workflows (which is the majority our workflows), since you can't simply argo get them without going through the server.
This is a valid point. A new server/CLI command would probably be needed for this.
we are trying to answer whether the workflow has any Pending nodes right now. This seems to align with the definitions in the linked article?
I think what I'm trying to say is that it sounds like you're interested in a particular instance of a workflow, not, say, an aggregated number of Pending statuses across all workflows or from a constant number of workflows (e.g. cron workflows). The distinction here, while seemingly simple, is that in the former you attach a label with the unique workflow name to a metric and as you run more and more workflows, more and more metrics are created. This issue is called cardinality explosion and we do our best to prevent our users from stepping into it:
CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
Source. Also see Cardinality is Key.
I think a CLI/server command to get a list of workflows with suspended nodes is probably what you're looking for here. This is however, a poll model. If polling is not good here, then maybe this can be combined with a metric that returns the total number of suspended nodes in your cluster (without labels, hence creating the cardinality problem) and when that number is >0, you can query the CLI/server command. Reading metrics is also a poll model however.
I think it would be helpful to also have it take set to Running when the node has Phase Running.
I'll look into this
@ebr
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Summary
We added Prometheus realtime gauges to templates with value of
{{duration}}
, and labeled with{{status}}
. We were surprised to see that the value ofPending
step duration keeps increasing even after the step has transitioned into Running and Succeeded/Failed states, and beyond workflow completion. The only way to stop this metric from being emitted is to delete the workflow.Also, the
{{duration}}
for theRunning
state seems to not be emitted at all.We would expect the
Pending
series to flatten once the step exits the Pending state, and for theRunning
series to be present.A separate enhancement request might be to stop emitting realtime metrics for completed workflows, to both reduce the number of scraped series, and help report on the current state via
instant
queries, rather than having to calculatedelta()
on gauges to determine whether the value is still changing.Argo Workflows v3.2.3
Diagnostics
The workflow is a slightly modified version of the metrics example:
As a result of this workflow we can see that:
status="Succeeded"
andstatus="Failed"
series only start being emitted once the steps enter their respective state, and remain at that value.status="Pending"
steps series continue increasing the value of the gauge, even though the steps have long left thePending
state. The expected behaviour is that once step is running, thePending
series' value should flatten.Running
series at all.Once the workflows are deleted, these metrics stop being emitted, as expected.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.