argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.56k stars 3.12k forks source link

inconsistent metric `argo_workflows_count` #13296

Open static-moonlight opened 3 days ago

static-moonlight commented 3 days ago

Pre-requisites

What happened/what did you expect to happen?

The metric argo_workflows_count, provided by the /metrics endpoint of the workflow controller is not consistent with the current state of my Argo Workflows instance, what the user interface shows and what can be extracted using kubectl:

# HELP argo_workflows_count Number of Workflows currently accessible by the controller by status (refreshed every 15s)
# TYPE argo_workflows_count gauge
argo_workflows_count{status="Error"} 1
argo_workflows_count{status="Failed"} 47
argo_workflows_count{status="Pending"} 0
argo_workflows_count{status="Running"} 6
argo_workflows_count{status="Succeeded"} 1059

image

The output of kubectl get workflows -A is consistent with the ui. The metric argo_workflows_count however shows something completely different. I would expect that the metrics and the values in the user interface are identical (aside from a 15 sec delay). Technically speaking: that the values come from the same origin.

relevant part of my config:

metricsConfig:
  enabled: true
  path: /metrics
  port: 9090
persistence:
  archive: false
  connectionPool:
    maxIdleConns: 100
    maxOpenConns: 0
    connMaxLifetime: 0s
  postgresql:
    host: database.argo
    port: 5432
    database: argo
    [...]
artifactRepository:
  s3:
    [...]
workflowDefaults:
  spec:
    ttlStrategy:
      [...]

Version

3.5.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

not workflow-related, affects the workflow controller metric `argo_workflows_count`

Logs from the workflow controller

I couldn't find any specific logs, which would explain why the metric `argo_workflows_count` doesn't line up with everything else.

Logs from in your workflow's wait container

not workflow-related, affects the workflow controller metric `argo_workflows_count`
agilgur5 commented 3 days ago

Follow-up to #13249

jswxstw commented 2 days ago

It seems like that deleted workflows due to gc also counts in metric argo_workflows_count.