Closed Joibel closed 2 months ago
I do not intend to implement tracing as part of this proposal.
For reference, we do have a separate feature request for that: #12077
The metric which are currently wrongly named are being deliberately broken
I am all for following standards and existing conventions. So long as we're explicit in the release notes, let's break those to fix them. As their current names are confusing / incorrect, I agree that there isn't much purpose in leaving them as-is for backward-compat. The faster we break broken things (to fix them), the better, as users have had less time to rely on or get used to them.
There will also not be a separate prometheus scrape point for metrics and telemetry. This doesn't make any sense anyway. The config for
telemetry
will be removed.
I was wondering why there were two different endpoints; were these just the same data but differently formatted? This would simplify a number of things for operators and also for the Helm chart, b/c it is pretty confusing rn IMO
Potentially related (I haven't root caused it yet): #12206 had remaining issues with metrics as well. We've fixed the root cause of the issue, but the metrics are still confusing. If we can roll those fixes into this, that'd be great, but they might require separate fixes.
I was wondering why there were two different endpoints; were these just the same data but differently formatted? This would simplify a number of things for operators and also for the Helm chart, b/c it is pretty confusing rn IMO
Chatted about this during today's Contributor Meeting.
They are actually both Prometheus-style right now, but the /telemetry
endpoint is for Golang metrics (e.g. number of goroutines etc) and the /metrics
endpoint is Argo's own metrics (listed above).
I did find that part of the codebase in metrics/server.go
. I think I was reading through metrics/metrics.go
in the past, which does not distinguish between them.
These changes look good. They will simplify my approach to metrics and reduce the reliance on custom template metrics.
Although there might be some holes in the possibilities. To clarify, the main information I want to know to build a dashboard on is:
I think some of this is possible with the proposal and please correct me where I am wrong and that it is just a matter of query writing. For example, regarding workflow durations on template/step level as well as workflow level. I wonder is this something that is possible to craft a query for using the right labels on argo_workflows_total_count
? Perhaps by taking the time delta between certain phase changes. Or does this make sense to be given its own metric?
In any case, these changes as described in the proposal will certainly be an improvement.
Responding to @tvandinther:
Metric time resolution will depend upon collection frequency:
How many failures of a particular workflow (by template name) have occurred (and implicitly when)
This is collectable (for templates run via workflowTemplateRef) from argo_workflows_workflowtemplate_triggered_total
via the phase
=Failed label.
How long was the workflow and each step in each of the various states (by virtue of the time they entered and exited those states)
Attempting to collect step/task level data isn't possible here. See below on tracing.
How much resources did the workflow and steps use? This one is likely best answered with some kind of join with container metrics so having a label to join on would be important. Then by extension I can also answer the question, what proportion of container resource requests was used so that I can optimise it.
This won't change. You can already label up the workflow, or do it via cronworkflow, event triggering and then you'll have something to join to your pod. We're not collecting step level information to avoid cardinality explosions.
argo_workflows_total_count
won't answer the questions you're after at a workflow or step granular level because it's really only a counter
replacement for argo_workflows_count
which is actually a gauge.
My intention is that once this is in I'll add opentelemetry tracing for workflows. Workflow runs will be a trace, and the individual parts of the workflow will be spans within it.
Using span to metrics you can then generate your own metrics from this information. This should be able to collect the detail you're after. I'll do some example span to metrics conversions for some of these things, because they're questions I'd like to be able to answer also.
I have a few use cases i like to see supported:
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: dummy
spec:
serviceAccountName: pod-sa
templates:
- name: extract
inputs: {}
outputs: {}
dag:
tasks:
- name: etl-source1
arguments:
parameters: []
templateRef:
name: work
template: work
- name: etl-source2
arguments:
parameters: []
templateRef:
name: work
template: work
entrypoint: extract
outputs: {}
metrics:
prometheus:
- name: job_last_success_timestamp
labels:
- key: job
value: dummy
- key: job_namespace
value: '{{workflow.namespace}}'
- key: job_UID
value: '{{workflow.uid}}'
help: Unix timestamp of last successful job completion
when: '{{status}} == Succeeded'
gauge:
value: '{{=workflow.creationTimestamp.s}}'
- name: job_last_failure_timestamp
labels:
- key: job
value: dummy
- key: job_namespace
value: '{{workflow.namespace}}'
- key: job_UID
value: '{{workflow.uid}}'
help: Unix timestamp of last failed job
when: '{{status}} == Error'
gauge:
value: '{{=workflow.creationTimestamp.s}}'
Here I can see the timestamp of the last failed or succeeded run, but would like to also see which steps in the DAG failed. My goal is: in the event some steps of the dag failed, i would like to still know when the last successful run of each etl-source was ran
Custom metrics It would be cool to get information from the individual steps somehow. My suggestion is to allow parsing of json objects. Let us say i have a container, that outputs to std out:
{
entitiesProcessed: 2000,
entitiesFailed: 0,
entitiesSkipped: 0,
}
Now i like to use these like this:
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: dummy
spec:
serviceAccountName: pod-sa
templates:
- name: extract
inputs: {}
outputs: {}
dag:
tasks:
- name: etl-source1
arguments:
parameters: []
templateRef:
name: work
template: work
metrics:
prometheus:
- name: entities_processed
labels:
- key: job
value: dummy
- key: job_namespace
value: '{{workflow.namespace}}'
- key: job_UID
value: '{{workflow.uid}}'
help: Unix timestamp of last failed job
when: '{{status}} == Error'
gauge:
value: '{{=workflow.DAG.etl-source1.output.entitiesProcessed}}'
Adding another vote of support for this proposal - we mostly run Argo Workflows via CronWorkflow resources that invoke WorkflowTemplate resources. The proposed new metrics argo_workflows_cronworkflows_triggered_total
, argo_workflows_workflowtemplate_triggered_total
, and argo_workflows_workflowtemplate_runtime
will be extremely valuable for us.
If we have the originName argument sorted out (separate proposal)
I couldn't find this originName proposal, but this also sounds like it's something we'd be very interested in. (I did find https://github.com/argoproj/argo-workflows/pull/10745 ).
Hello @Joibel ,
we would also be very interested in this feature.
When reading the proposal, I still wonder which aggregation temporality will be used for opentelemetry metrics (sums or histograms): cumulative or delta ?
Some backends (e.g. prometheus) only support cumulative temporality while others (e.g. dynatrace) only support delta metrics.\ It is possible to configure an opentelemetry collector between argo workflow controller and the backend to convert cumulative metrics into delta metrics but it has some limits (e.g. when controller is restarted or when it is scaled up).
Which temporality had you in mind for this proposal ? Could you consider supporting both temporalities and allow users to select the one they need by configuration of the controller ?
@fstaudt - I hadn't considered anyone would want anything other than the default in the otel SDK, which is cumulative.
The intended usage is with a collector between the controller and the data sink.
I may implement it as part of the initial commit, but I am wary of feature creep, so may wait to do it in a later PR. It should be relatively easy to implement either way.
This is merged as originally proposed.
Proposal for improving workflows metrics
This proposal is an overhaul of workflow metrics to bring them up to date, correct them, and make sure they are useful. It is intended to be a breaking change requiring users to update their use of metrics.
Opentelemetery
I'd like to move to using opentelemetry (otel) as the libraries for collecting metrics. They are mature and the design of the interfaces encourages good practices.
We'd still be able to expose the resulting metrics as a prometheus compatible
/metrics
scrapable endpoint using any prometheus compatible scraper.It would allow us to also emit the same metrics using the opentelemetry protocol with the benefits this provides. This would also allow us to emit tracing using the same protocol and libraries. I do not intend to implement tracing as part of this proposal.
Each transmission protocol would be separately configured. Data collection would be identical in the code no matter which protocols are in use.
Changes to the metrics
Some words have specific meanings in the world of observability. Opentelemetry metrics explains these words - make particular note of counter and gauge which are currently used incorrectly within workflows. Counts MUST only ever go up.
A counter is something that ever-increases (unless the controller is restarted, in which case it will reset to 0, which the querying tool will understand and do the correct thing). You can use this to monitor changes using "increase" functions which allow you to monitor changes over your choice of time period. Gauges can be constructed from counters given both a count of incoming and outgoing events.
This is being done with the mindset of in future running multiple active controllers (horizontal controller scaling Horizontally Scalable Controller ). The existing metrics are hard to use in this context.
Existing metrics
argo_workflows_count
argo_workflows_gauge
argo_workflows_total_count
pending
,running
,errored
,succeeded
,failed
,removed
) as a true counterargo_workflows_error_count
argo_workflows_k8s_request_total
argo_workflows_operation_duration_seconds
argo_workflows_pods_count
argo_workflows_pods_gauge
argo_workflows_pods_total_count
argo_workflows_total_count
argo_workflows_queue_adds_count
argo_workflows_queue_depth_count
argo_workflows_queue_depth_gauge
argo_workflows_queue_latency
argo_workflows_workers_busy
argo_workflows_workflow_condition
argo_workflows_workflows_processed_count
log_messages
argo_workflows_log_messages
argo_workflows_total_count
detailsThis would allow users to answer questions such as:
For a given time period, how many workflows were executed? How many passed? How many failed? How many errored?
New metrics
I know some of the items above are new, but they are related to existing metrics so documented along side them.
argo_workflows_controller_build_info
argo_workflows_cronworkflows_triggered_total
argo_workflows_workflowtemplate_triggered_total
argo_workflows_clusterworkflowtemplate_triggered_total
argo_workflows_workflowtemplate_runtime
argo_workflows_workflowtemplate_used_total
as a new metricI'd like to count usage (which would not include status) of each template, wherever it is used. I'm not sure that there aren't hurdles to doing this correctly, so I'll put this as a maybe.
originName enhancements
If we have the originName argument sorted out (separate proposal) we could also have similar metrics as the workflowtemplate ones for originName (total by status and runtime histogram).
Dashboard
The grafana dashboard will be updated to use these new metrics
Couldn't we migrate to otel separately from changing metrics?
We could, but I think whilst we're in there it's better to fix both sides of this at the same time as there will be a fair amount of churn when doing either thing.
This is a breaking change
This change will allow you to continue to collect metrics using prometheus scraping and mostly things are similar.
The metric which are currently wrongly named are being deliberately broken and will no longer be emitted under their old name. They will still be available under a new name, so for minimal effort of updating you can just change any graphs to use the new name. This isn't the recommended course of action, and using the new metrics should give a better picture of the state of your cluster and the workflows it contains.
Leaving them with their old names means extra configuration flags and code complication for quite minimal benefit.
There will also not be a separate prometheus scrape point for metrics and telemetry. This doesn't make any sense anyway. The config for
telemetry
will be removed.