argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.03k stars 3.2k forks source link

Metric duration gauge keeps incrementing for 'Pending' status, until the workflow is deleted #7335

Open ebr opened 2 years ago

ebr commented 2 years ago

Summary

We added Prometheus realtime gauges to templates with value of {{duration}}, and labeled with {{status}}. We were surprised to see that the value of Pending step duration keeps increasing even after the step has transitioned into Running and Succeeded/Failed states, and beyond workflow completion. The only way to stop this metric from being emitted is to delete the workflow.

Also, the {{duration}} for the Running state seems to not be emitted at all.

We would expect the Pending series to flatten once the step exits the Pending state, and for the Running series to be present.

A separate enhancement request might be to stop emitting realtime metrics for completed workflows, to both reduce the number of scraped series, and help report on the current state via instant queries, rather than having to calculate delta() on gauges to determine whether the value is still changing.

Argo Workflows v3.2.3

Diagnostics

The workflow is a slightly modified version of the metrics example:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  creationTimestamp: "2021-12-02T23:12:31Z"
  generateName: metrics-
  generation: 5
  labels:
    workflows.argoproj.io/archive-strategy: always
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/phase: Failed
    workflows.argoproj.io/workflow-archiving-status: Archived
  managedFields:
  - apiVersion: argoproj.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName: {}
    manager: argo
    operation: Update
    time: "2021-12-02T23:12:31Z"
  - apiVersion: argoproj.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:workflows.argoproj.io/archive-strategy: {}
          f:workflows.argoproj.io/completed: {}
          f:workflows.argoproj.io/phase: {}
          f:workflows.argoproj.io/workflow-archiving-status: {}
      f:spec: {}
      f:status: {}
    manager: workflow-controller
    operation: Update
    time: "2021-12-02T23:13:52Z"
  name: metrics-lcvgb
  namespace: sandbox
  resourceVersion: "317267688"
  uid: a2525a26-69b0-486b-9756-475f07c56b8d
spec:
  entrypoint: main
  securityContext: {}
  serviceAccountName: argo-workflows
  templates:
  - inputs: {}
    metadata: {}
    name: main
    outputs: {}
    steps:
    - - arguments: {}
        name: random-int
        template: random-int
    - - arguments: {}
        name: flakey
        template: flakey
  - container:
      args:
      - RAND_INT=$((1 + RANDOM % 10)); echo $RAND_INT; echo $RAND_INT > /tmp/rand_int.txt;
        sleep 60
      command:
      - sh
      - -c
      image: alpine:latest
      name: ""
      resources: {}
    inputs: {}
    metadata: {}
    metrics:
      prometheus:
      - counter:
          value: "1"
        help: Count of step execution by result
        labels:
        - key: step_template
          value: random-int
        - key: workflow_namespace
          value: '{{workflow.namespace}}'
        - key: workflow_name
          value: '{{workflow.name}}'
        - key: status
          value: '{{status}}'
        name: step_result_count
      - gauge:
          realtime: true
          value: '{{duration}}'
        help: Duration of template execution
        labels:
        - key: step_template
          value: random-int
        - key: workflow_namespace
          value: '{{workflow.namespace}}'
        - key: workflow_name
          value: '{{workflow.name}}'
        - key: status
          value: '{{status}}'
        name: step_duration_gauge
    name: random-int
    outputs:
      parameters:
      - globalName: rand-int-value
        name: rand-int-value
        valueFrom:
          path: /tmp/rand_int.txt
  - container:
      args:
      - import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)
      command:
      - python
      - -c
      image: python:alpine3.6
      name: ""
      resources: {}
    inputs: {}
    metadata: {}
    metrics:
      prometheus:
      - counter:
          value: "1"
        help: Count of step execution by result
        labels:
        - key: step_template
          value: flakey
        - key: workflow_namespace
          value: '{{workflow.namespace}}'
        - key: workflow_name
          value: '{{workflow.name}}'
        - key: status
          value: '{{status}}'
        name: step_result_count
      - gauge:
          realtime: true
          value: '{{duration}}'
        help: Duration of template execution
        labels:
        - key: step_template
          value: flakey
        - key: workflow_namespace
          value: '{{workflow.namespace}}'
        - key: workflow_name
          value: '{{workflow.name}}'
        - key: status
          value: '{{status}}'
        name: step_duration_gauge
    name: flakey
    outputs: {}
  ttlStrategy:
    secondsAfterCompletion: 84600
status:
  conditions:
  - status: "False"
    type: PodRunning
  - status: "True"
    type: Completed
  finishedAt: "2021-12-02T23:13:52Z"
  message: child 'metrics-lcvgb-1593082173' failed
  nodes:
    metrics-lcvgb:
      children:
      - metrics-lcvgb-3470239446
      displayName: metrics-lcvgb
      finishedAt: "2021-12-02T23:13:52Z"
      id: metrics-lcvgb
      message: child 'metrics-lcvgb-1593082173' failed
      name: metrics-lcvgb
      phase: Failed
      progress: 2/2
      resourcesDuration:
        cpu: 122
        memory: 184
      startedAt: "2021-12-02T23:12:31Z"
      templateName: main
      templateScope: local/metrics-lcvgb
      type: Retry
    metrics-lcvgb-567759933:
      boundaryID: metrics-lcvgb-3470239446
      children:
      - metrics-lcvgb-1593082173
      displayName: '[1]'
      finishedAt: "2021-12-02T23:13:52Z"
      id: metrics-lcvgb-567759933
      message: child 'metrics-lcvgb-1593082173' failed
      name: metrics-lcvgb(0)[1]
      phase: Failed
      progress: 1/1
      resourcesDuration:
        cpu: 1
        memory: 2
      startedAt: "2021-12-02T23:13:42Z"
      templateScope: local/metrics-lcvgb
      type: StepGroup
    metrics-lcvgb-1414972696:
      boundaryID: metrics-lcvgb-3470239446
      children:
      - metrics-lcvgb-567759933
      displayName: random-int(0)
      finishedAt: "2021-12-02T23:13:32Z"
      hostNodeName: ip-192-168-37-144.us-east-2.compute.internal
      id: metrics-lcvgb-1414972696
      name: metrics-lcvgb(0)[0].random-int(0)
      outputs:
        exitCode: "0"
        parameters:
        - globalName: rand-int-value
          name: rand-int-value
          value: "10"
          valueFrom:
            path: /tmp/rand_int.txt
      phase: Succeeded
      progress: 1/1
      resourcesDuration:
        cpu: 121
        memory: 182
      startedAt: "2021-12-02T23:12:31Z"
      templateName: random-int
      templateScope: local/metrics-lcvgb
      type: Pod
    metrics-lcvgb-1593082173:
      boundaryID: metrics-lcvgb-3470239446
      children:
      - metrics-lcvgb-3459789644
      displayName: flakey
      finishedAt: "2021-12-02T23:13:52Z"
      id: metrics-lcvgb-1593082173
      message: Error (exit code 1)
      name: metrics-lcvgb(0)[1].flakey
      outputs:
        exitCode: "1"
      phase: Failed
      progress: 1/1
      resourcesDuration:
        cpu: 1
        memory: 2
      startedAt: "2021-12-02T23:13:42Z"
      templateName: flakey
      templateScope: local/metrics-lcvgb
      type: Retry
    metrics-lcvgb-3459789644:
      boundaryID: metrics-lcvgb-3470239446
      displayName: flakey(0)
      finishedAt: "2021-12-02T23:13:44Z"
      hostNodeName: ip-192-168-37-144.us-east-2.compute.internal
      id: metrics-lcvgb-3459789644
      message: Error (exit code 1)
      name: metrics-lcvgb(0)[1].flakey(0)
      outputs:
        exitCode: "1"
      phase: Failed
      progress: 1/1
      resourcesDuration:
        cpu: 1
        memory: 2
      startedAt: "2021-12-02T23:13:42Z"
      templateName: flakey
      templateScope: local/metrics-lcvgb
      type: Pod
    metrics-lcvgb-3470239446:
      children:
      - metrics-lcvgb-3721996232
      displayName: metrics-lcvgb(0)
      finishedAt: "2021-12-02T23:13:52Z"
      id: metrics-lcvgb-3470239446
      message: child 'metrics-lcvgb-1593082173' failed
      name: metrics-lcvgb(0)
      outboundNodes:
      - metrics-lcvgb-3459789644
      phase: Failed
      progress: 2/2
      resourcesDuration:
        cpu: 122
        memory: 184
      startedAt: "2021-12-02T23:12:31Z"
      templateName: main
      templateScope: local/metrics-lcvgb
      type: Steps
    metrics-lcvgb-3595365697:
      boundaryID: metrics-lcvgb-3470239446
      children:
      - metrics-lcvgb-1414972696
      displayName: random-int
      finishedAt: "2021-12-02T23:13:42Z"
      id: metrics-lcvgb-3595365697
      name: metrics-lcvgb(0)[0].random-int
      outputs:
        exitCode: "0"
        parameters:
        - globalName: rand-int-value
          name: rand-int-value
          value: "10"
          valueFrom:
            path: /tmp/rand_int.txt
      phase: Succeeded
      progress: 2/2
      resourcesDuration:
        cpu: 122
        memory: 184
      startedAt: "2021-12-02T23:12:31Z"
      templateName: random-int
      templateScope: local/metrics-lcvgb
      type: Retry
    metrics-lcvgb-3721996232:
      boundaryID: metrics-lcvgb-3470239446
      children:
      - metrics-lcvgb-3595365697
      displayName: '[0]'
      finishedAt: "2021-12-02T23:13:42Z"
      id: metrics-lcvgb-3721996232
      name: metrics-lcvgb(0)[0]
      phase: Succeeded
      progress: 2/2
      resourcesDuration:
        cpu: 122
        memory: 184
      startedAt: "2021-12-02T23:12:31Z"
      templateScope: local/metrics-lcvgb
      type: StepGroup
  outputs:
    parameters:
    - name: rand-int-value
      value: "10"
  phase: Failed
  progress: 2/2
  resourcesDuration:
    cpu: 122
    memory: 184
  startedAt: "2021-12-02T23:12:31Z"

As a result of this workflow we can see that:

  1. status="Succeeded" and status="Failed" series only start being emitted once the steps enter their respective state, and remain at that value.
  2. status="Pending" steps series continue increasing the value of the gauge, even though the steps have long left the Pending state. The expected behaviour is that once step is running, the Pending series' value should flatten.
  3. we are not seeing a value for the Running series at all.

Once the workflows are deleted, these metrics stop being emitted, as expected.

image

image


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

simster7 commented 2 years ago

Hmm... this is going to be tricky. I think the best answer right now is that you shouldn't use {{status}} on real time duration metrics. The reason why is mostly implementation: we currently don't record timings between transition specific states (e.g. Pending -> Running). The only one we record is when a workflow finishes (Status.FinishedAt) and compare it to when it started (Status.StartedAt) and use that information to compute the duration. If Status.FinishedAt does not exist, we simply compute the difference between now and Status.StartedAt.

Even if when we fix the bug where the Pending keeps increasing indefinitely, we currently have no way to give you information on how much time it spent on Pending state (since we don't record it anywhere).

If you are looking to find a way to see how long your workflows are pending before they are executed, there are a lot of default metric (argo_workflows_queue_*) which could act as a proxy for this.

ebr commented 2 years ago

The real reason I started this experiment was to see if I can identify any workflow nodes that are currently in Suspended state, so that the user is given the opportunity to resolve the root cause of some error condition before resuming the workflow. I didn't find a way to do this, because I don't think the Suspended state is recorded anywhere. (I can open a separate issue about this, unless I'm missing something)

That aside, I think fixing the Pending increasing indefinitely would still be helpful. Because then the query delta(step_duration_gauge{status="Pending"}[1m]) will return 0 once the Pending series stops increasing in value. So we can that way infer at least that the node has transitioned out of Pending. And extending that to by (workflow_name) max(..) of that query will tell us if a workflow has any Pending nodes.

To be clear, this is different from measuring the number of Pending pods (which we do separately), because pending workflow nodes might have not even created a Pod yet. (Correct me if I'm wrong please).

It would be nice to also have the Running state emitted, for parity with the states that can be filtered on with argo get --status=..., but that's more of a feature request. I'm going to open one for this as well as my original issue of Suspended statuses not being recorded, because ultimately I need to use the above PromQL queries with Suspended statuses above all.

Hope this makes sense?

simster7 commented 2 years ago

I didn't find a way to do this, because I don't think the Suspended state is recorded anywhere.

Suspended is inferred when either the workflow or any node in the workflow has Phase Running and is a NodeTypeSuspend. You could identify any nodes by looking at the full workflow yaml status.

That aside, I think fixing the Pending increasing indefinitely would still be helpful.

In this case the value of step_duration_gauge{status="Pending"} can only be 0 after it transitions out (since we don't record timings).

So we can that way infer at least that the node has transitioned out of Pending.

This might be a misuse of metrics. Telemetry is not meant to answer the state of any individual workflow.

It would be nice to also have the Running state emitted

Sorry, not sure what you mean by emitted here?

ebr commented 2 years ago

Thank you for pointing out NodeTypeSuspend - this is helpful

You could identify any nodes by looking at the full workflow yaml status

That is true, but I am aiming to make this a better experience for a user without having to parse the status YAML. This is especially important for offloaded workflows (which is the majority our workflows), since you can't simply argo get them without going through the server.

This might be a misuse of metrics.

I'm not sure I agree - we are trying to answer whether the workflow has any Pending nodes right now. This seems to align with the definitions in the linked article?

nice to also have the Running state emitted

sorry, I mean that currently the value of status workflow/template variable can be one of: Succeeded, Pending, Failed. (possibly also Error)? I think it would be helpful to also have it take set to Running when the node has Phase Running.

simster7 commented 2 years ago

That is true, but I am aiming to make this a better experience for a user without having to parse the status YAML. This is especially important for offloaded workflows (which is the majority our workflows), since you can't simply argo get them without going through the server.

This is a valid point. A new server/CLI command would probably be needed for this.

we are trying to answer whether the workflow has any Pending nodes right now. This seems to align with the definitions in the linked article?

I think what I'm trying to say is that it sounds like you're interested in a particular instance of a workflow, not, say, an aggregated number of Pending statuses across all workflows or from a constant number of workflows (e.g. cron workflows). The distinction here, while seemingly simple, is that in the former you attach a label with the unique workflow name to a metric and as you run more and more workflows, more and more metrics are created. This issue is called cardinality explosion and we do our best to prevent our users from stepping into it:

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

Source. Also see Cardinality is Key.

I think a CLI/server command to get a list of workflows with suspended nodes is probably what you're looking for here. This is however, a poll model. If polling is not good here, then maybe this can be combined with a metric that returns the total number of suspended nodes in your cluster (without labels, hence creating the cardinality problem) and when that number is >0, you can query the CLI/server command. Reading metrics is also a poll model however.

I think it would be helpful to also have it take set to Running when the node has Phase Running.

I'll look into this

@ebr

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.