Joibel commented 9 months ago

Proposal for improving workflows metrics

This proposal is an overhaul of workflow metrics to bring them up to date, correct them, and make sure they are useful. It is intended to be a breaking change requiring users to update their use of metrics.

Opentelemetery

I'd like to move to using opentelemetry (otel) as the libraries for collecting metrics. They are mature and the design of the interfaces encourages good practices.

We'd still be able to expose the resulting metrics as a prometheus compatible /metrics scrapable endpoint using any prometheus compatible scraper.

It would allow us to also emit the same metrics using the opentelemetry protocol with the benefits this provides. This would also allow us to emit tracing using the same protocol and libraries. I do not intend to implement tracing as part of this proposal.

Each transmission protocol would be separately configured. Data collection would be identical in the code no matter which protocols are in use.

Changes to the metrics

Some words have specific meanings in the world of observability. Opentelemetry metrics explains these words - make particular note of counter and gauge which are currently used incorrectly within workflows. Counts MUST only ever go up.

A counter is something that ever-increases (unless the controller is restarted, in which case it will reset to 0, which the querying tool will understand and do the correct thing). You can use this to monitor changes using "increase" functions which allow you to monitor changes over your choice of time period. Gauges can be constructed from counters given both a count of incoming and outgoing events.

This is being done with the mindset of in future running multiple active controllers (horizontal controller scaling Horizontally Scalable Controller ). The existing metrics are hard to use in this context.

Existing metrics

Existing name	Proposed name	User reported problems
argo_pod_missing	argo_workflows_pod_missing_count	Match the prefix and change this to a counter of incidents. As a gauge you cannot easily tell what happened on workflow controller restart
`argo_workflows_count`	`argo_workflows_gauge`	This is emitting the number workflow resources that are in the cluster. If the user makes use of garbage collection and/or archiving, the workflows will no longer be counted (and the 'count' will go down).
new	`argo_workflows_total_count`	count workflows as they enter each phase (`pending`, `running`, `errored`, `succeeded`, `failed`, `removed` ) as a true counter
`argo_workflows_error_count`	no change
`argo_workflows_k8s_request_total`	no change
`argo_workflows_operation_duration_seconds`	no change
`argo_workflows_pods_count`	`argo_workflows_pods_gauge`	it's a gauge
new	`argo_workflows_pods_total_count`	same arguments as for `argo_workflows_total_count`
`argo_workflows_queue_adds_count`	no change
`argo_workflows_queue_depth_count`	`argo_workflows_queue_depth_gauge`	it is a gauge
`argo_workflows_queue_latency`	no change
`argo_workflows_workers_busy`	no change
`argo_workflows_workflow_condition`	no change
`argo_workflows_workflows_processed_count`	remove	This has never functioned as is always 0.
`log_messages`	`argo_workflows_log_messages`	Document its existence

`argo_workflows_total_count` details

This would allow users to answer questions such as:

For a given time period, how many workflows were executed? How many passed? How many failed? How many errored?

New metrics

I know some of the items above are new, but they are related to existing metrics so documented along side them.

Name	Other changes and notes
`argo_workflows_controller_build_info`	would have a label of the version of the workflow-controller, os, architecture, go version, and git sha.
`argo_workflows_cronworkflows_triggered_total`	would have labels of name, namespace and status, and be incremented for each of the phases the workflow goes through
`argo_workflows_workflowtemplate_triggered_total`	would have labels of name, namespace and status, and be incremented for each of the phases the workflow goes through. Would only count top level workflowTemplateRef style running
`argo_workflows_clusterworkflowtemplate_triggered_total`	As previous, but for ClusterWorkflowTemplates (and obviously no namespace label)
`argo_workflows_workflowtemplate_runtime`	would have labels of name and namespace and count the same as the previous. histogram of time between trigger and completed

`argo_workflows_workflowtemplate_used_total` as a new metric

I'd like to count usage (which would not include status) of each template, wherever it is used. I'm not sure that there aren't hurdles to doing this correctly, so I'll put this as a maybe.

originName enhancements

If we have the originName argument sorted out (separate proposal) we could also have similar metrics as the workflowtemplate ones for originName (total by status and runtime histogram).

How many of my given originName is running right now?
What was the last start time of originName?
Did the last originName run fail?
Alert if it's been more than x days since the last originName workflow ran successfully

Dashboard

The grafana dashboard will be updated to use these new metrics

Couldn't we migrate to otel separately from changing metrics?

We could, but I think whilst we're in there it's better to fix both sides of this at the same time as there will be a fair amount of churn when doing either thing.

This is a breaking change

This change will allow you to continue to collect metrics using prometheus scraping and mostly things are similar.

The metric which are currently wrongly named are being deliberately broken and will no longer be emitted under their old name. They will still be available under a new name, so for minimal effort of updating you can just change any graphs to use the new name. This isn't the recommended course of action, and using the new metrics should give a better picture of the state of your cluster and the workflows it contains.

Leaving them with their old names means extra configuration flags and code complication for quite minimal benefit.

There will also not be a separate prometheus scrape point for metrics and telemetry. This doesn't make any sense anyway. The config for telemetry will be removed.

agilgur5 commented 9 months ago

I do not intend to implement tracing as part of this proposal.

For reference, we do have a separate feature request for that: #12077

The metric which are currently wrongly named are being deliberately broken

I am all for following standards and existing conventions. So long as we're explicit in the release notes, let's break those to fix them. As their current names are confusing / incorrect, I agree that there isn't much purpose in leaving them as-is for backward-compat. The faster we break broken things (to fix them), the better, as users have had less time to rely on or get used to them.

There will also not be a separate prometheus scrape point for metrics and telemetry. This doesn't make any sense anyway. The config for telemetry will be removed.

I was wondering why there were two different endpoints; were these just the same data but differently formatted? This would simplify a number of things for operators and also for the Helm chart, b/c it is pretty confusing rn IMO

agilgur5 commented 9 months ago

Potentially related (I haven't root caused it yet): #12206 had remaining issues with metrics as well. We've fixed the root cause of the issue, but the metrics are still confusing. If we can roll those fixes into this, that'd be great, but they might require separate fixes.

agilgur5 commented 9 months ago

I was wondering why there were two different endpoints; were these just the same data but differently formatted? This would simplify a number of things for operators and also for the Helm chart, b/c it is pretty confusing rn IMO

Chatted about this during today's Contributor Meeting. They are actually both Prometheus-style right now, but the /telemetry endpoint is for Golang metrics (e.g. number of goroutines etc) and the /metrics endpoint is Argo's own metrics (listed above). I did find that part of the codebase in metrics/server.go. I think I was reading through metrics/metrics.go in the past, which does not distinguish between them.

tvandinther commented 7 months ago

These changes look good. They will simplify my approach to metrics and reduce the reliance on custom template metrics.

Although there might be some holes in the possibilities. To clarify, the main information I want to know to build a dashboard on is:

How many failures of a particular workflow (by template name) have occurred (and implicitly when)
How long was the workflow and each step in each of the various states (by virtue of the time they entered and exited those states)
How much resources did the workflow and steps use? This one is likely best answered with some kind of join with container metrics so having a label to join on would be important. Then by extension I can also answer the question, what proportion of container resource requests was used so that I can optimise it.

I think some of this is possible with the proposal and please correct me where I am wrong and that it is just a matter of query writing. For example, regarding workflow durations on template/step level as well as workflow level. I wonder is this something that is possible to craft a query for using the right labels on argo_workflows_total_count? Perhaps by taking the time delta between certain phase changes. Or does this make sense to be given its own metric?

In any case, these changes as described in the proposal will certainly be an improvement.

Joibel commented 7 months ago

Responding to @tvandinther:

Metric time resolution will depend upon collection frequency:

How many failures of a particular workflow (by template name) have occurred (and implicitly when)

This is collectable (for templates run via workflowTemplateRef) from argo_workflows_workflowtemplate_triggered_total via the phase=Failed label.

How long was the workflow and each step in each of the various states (by virtue of the time they entered and exited those states)

Attempting to collect step/task level data isn't possible here. See below on tracing.

How much resources did the workflow and steps use? This one is likely best answered with some kind of join with container metrics so having a label to join on would be important. Then by extension I can also answer the question, what proportion of container resource requests was used so that I can optimise it.

This won't change. You can already label up the workflow, or do it via cronworkflow, event triggering and then you'll have something to join to your pod. We're not collecting step level information to avoid cardinality explosions.

argo_workflows_total_count won't answer the questions you're after at a workflow or step granular level because it's really only a counter replacement for argo_workflows_count which is actually a gauge.

Tracing

My intention is that once this is in I'll add opentelemetry tracing for workflows. Workflow runs will be a trace, and the individual parts of the workflow will be spans within it.

Using span to metrics you can then generate your own metrics from this information. This should be able to collect the detail you're after. I'll do some example span to metrics conversions for some of these things, because they're questions I'd like to be able to answer also.

bok11 commented 6 months ago

I have a few use cases i like to see supported:

Seeing the status of the individual DAG flows Say i have a workflow defined as:

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: dummy
spec:
serviceAccountName: pod-sa
templates:
- name: extract
  inputs: {}
  outputs: {}
  dag:
    tasks:
      - name: etl-source1
        arguments:
          parameters: []
        templateRef:
          name: work
          template: work
      - name: etl-source2
        arguments:
          parameters: []
        templateRef:
          name: work
          template: work
entrypoint: extract
outputs: {}
metrics:
  prometheus:
        - name: job_last_success_timestamp
          labels:
            - key: job
              value: dummy
            - key: job_namespace
              value: '{{workflow.namespace}}'
            - key: job_UID
              value: '{{workflow.uid}}'
          help: Unix timestamp of last successful job completion
          when: '{{status}} == Succeeded'
          gauge:
            value: '{{=workflow.creationTimestamp.s}}'
       - name: job_last_failure_timestamp
          labels:
            - key: job
              value: dummy
            - key: job_namespace
              value: '{{workflow.namespace}}'
            - key: job_UID
              value: '{{workflow.uid}}'
          help: Unix timestamp of last failed job
          when: '{{status}} == Error'
          gauge:
            value: '{{=workflow.creationTimestamp.s}}'

Here I can see the timestamp of the last failed or succeeded run, but would like to also see which steps in the DAG failed. My goal is: in the event some steps of the dag failed, i would like to still know when the last successful run of each etl-source was ran

Custom metrics It would be cool to get information from the individual steps somehow. My suggestion is to allow parsing of json objects. Let us say i have a container, that outputs to std out:

{
entitiesProcessed: 2000,
entitiesFailed: 0,
entitiesSkipped: 0,
}

Now i like to use these like this:

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: dummy
spec:
serviceAccountName: pod-sa
templates:
- name: extract
  inputs: {}
  outputs: {}
  dag:
    tasks:
      - name: etl-source1
        arguments:
          parameters: []
        templateRef:
          name: work
          template: work
metrics:
 prometheus:          
       - name: entities_processed
          labels:
            - key: job
              value: dummy
            - key: job_namespace
              value: '{{workflow.namespace}}'
            - key: job_UID
              value: '{{workflow.uid}}'
          help: Unix timestamp of last failed job
          when: '{{status}} == Error'
          gauge:
            value: '{{=workflow.DAG.etl-source1.output.entitiesProcessed}}'

menzenski commented 5 months ago

Adding another vote of support for this proposal - we mostly run Argo Workflows via CronWorkflow resources that invoke WorkflowTemplate resources. The proposed new metrics argo_workflows_cronworkflows_triggered_total, argo_workflows_workflowtemplate_triggered_total, and argo_workflows_workflowtemplate_runtime will be extremely valuable for us.

If we have the originName argument sorted out (separate proposal)

I couldn't find this originName proposal, but this also sounds like it's something we'd be very interested in. (I did find https://github.com/argoproj/argo-workflows/pull/10745 ).

fstaudt commented 5 months ago

Hello @Joibel ,

we would also be very interested in this feature.

When reading the proposal, I still wonder which aggregation temporality will be used for opentelemetry metrics (sums or histograms): cumulative or delta ?

Some backends (e.g. prometheus) only support cumulative temporality while others (e.g. dynatrace) only support delta metrics.\ It is possible to configure an opentelemetry collector between argo workflow controller and the backend to convert cumulative metrics into delta metrics but it has some limits (e.g. when controller is restarted or when it is scaled up).

Which temporality had you in mind for this proposal ? Could you consider supporting both temporalities and allow users to select the one they need by configuration of the controller ?

Joibel commented 5 months ago

@fstaudt - I hadn't considered anyone would want anything other than the default in the otel SDK, which is cumulative.

The intended usage is with a collector between the controller and the data sink.

I may implement it as part of the initial commit, but I am wary of feature creep, so may wait to do it in a later PR. It should be relatively easy to implement either way.

Joibel commented 2 months ago

This is merged as originally proposed.

argoproj / argo-workflows

proposal: metrics overhaul including opentelemetry #12589

Proposal for improving workflows metrics

Opentelemetery

Changes to the metrics

Existing metrics

`argo_workflows_total_count` details

New metrics

`argo_workflows_workflowtemplate_used_total` as a new metric

originName enhancements

Dashboard

Couldn't we migrate to otel separately from changing metrics?

This is a breaking change

Tracing

argoproj / argo-workflows

proposal: metrics overhaul including opentelemetry #12589

Proposal for improving workflows metrics

Opentelemetery

Changes to the metrics

Existing metrics

argo_workflows_total_count details

New metrics

argo_workflows_workflowtemplate_used_total as a new metric

originName enhancements

Dashboard

Couldn't we migrate to otel separately from changing metrics?

This is a breaking change

Tracing

`argo_workflows_total_count` details

`argo_workflows_workflowtemplate_used_total` as a new metric