inconsistent metric `argo_workflows_count`

static-moonlight commented 4 months ago

Pre-requisites

[X] I have double-checked my configuration
[X] I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
[X] I have searched existing issues and could not find a match for this bug
[ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

The metric argo_workflows_count, provided by the /metrics endpoint of the workflow controller is not consistent with the current state of my Argo Workflows instance, what the user interface shows and what can be extracted using kubectl:

# HELP argo_workflows_count Number of Workflows currently accessible by the controller by status (refreshed every 15s)
# TYPE argo_workflows_count gauge
argo_workflows_count{status="Error"} 1
argo_workflows_count{status="Failed"} 47
argo_workflows_count{status="Pending"} 0
argo_workflows_count{status="Running"} 6
argo_workflows_count{status="Succeeded"} 1059

The output of kubectl get workflows -A is consistent with the ui. The metric argo_workflows_count however shows something completely different. I would expect that the metrics and the values in the user interface are identical (aside from a 15 sec delay). Technically speaking: that the values come from the same origin.

relevant part of my config:

metricsConfig:
  enabled: true
  path: /metrics
  port: 9090
persistence:
  archive: false
  connectionPool:
    maxIdleConns: 100
    maxOpenConns: 0
    connMaxLifetime: 0s
  postgresql:
    host: database.argo
    port: 5432
    database: argo
    [...]
artifactRepository:
  s3:
    [...]
workflowDefaults:
  spec:
    ttlStrategy:
      [...]

Version

3.5.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

not workflow-related, affects the workflow controller metric `argo_workflows_count`

Logs from the workflow controller

I couldn't find any specific logs, which would explain why the metric `argo_workflows_count` doesn't line up with everything else.

Logs from in your workflow's wait container

not workflow-related, affects the workflow controller metric `argo_workflows_count`

agilgur5 commented 4 months ago

Follow-up to #13249

jswxstw commented 4 months ago

~~It seems like that deleted workflows due to gc also counts in metric argo_workflows_count.~~

I misunderstood the way argo_workflows_count were being calculated, my bad...

Joibel commented 4 months ago

This should be a live glimpse into the informer's view of the cluster. Do you have a specific set of steps to get this to be wrong?

static-moonlight commented 4 months ago

This should be a live glimpse into the informer's view of the cluster. Do you have a specific set of steps to get this to be wrong?

I don't know what you mean by "specific set of steps"? Can you provide a command to query for those "specific set of steps"? I didn't have any failed workflow within the last 3 days, and the metric still reports 2192 failed workflows. If there is anything lying around, I don't know where.

Joibel commented 4 months ago

I meant a set of steps to reproduce this. If I take an empty cluster, what should I do to see the wrong metrics?

I cannot see the same thing locally with a quick test in k3d, nor in our production system. The numbers match in the UI, cluster and metrics. Version 3.5.8. The vast majority of our workflows are immediately garbage collected on completion, so we have a lot of GC happening.

static-moonlight commented 4 months ago

I can't really have a bulletproof procedure for that except:

Start Argo Workflows
Let workflows run (I assume any workflow would do) ... some should fail though
Check the metrics after a while

Joibel commented 4 months ago

As I said, I've done that, and our CI production instance is basically doing this, including failing. My metrics fall back to baseline numbers as expected when nothing is happening.

If you restart the workflow controller do the numbers go back to low numbers, or do they stay high?

static-moonlight commented 4 months ago

I think they will reset, when I restart the workflow controller ... I'm checking right now

The last restart of my workflow controller was 2 weeks ago:

$ kubectl get pods -n argo
NAME                                   READY   STATUS    RESTARTS   AGE
argo-server-7569f7f459-k5q69           1/1     Running   0          13d
database-0                             1/1     Running   0          13d
workflow-controller-76ffbdcf8f-xhx2f   1/1     Running   0          12d

... since then, metrics just build up:

sometime they go down a little bit, but they don't fallback to 0.

The current state of my (test) system:

$ kubectl get workflows -A
NAMESPACE      NAME                                                       STATUS      AGE    MESSAGE
[...]          [...]                                                      Running     1s
[...]          [...]
[...]          [...]                                                      Succeeded   60s
[...]          [...]                                                      Running     93s
[...]          [...]                                                      Running     18s
[...]          [...]                                                      Running     12s
[...]          [...]                                                      Running     84s
[...]          [...]                                                      Running     17s
[...]          [...]                                                      Running     92s
[...]          [...]                                                      Running     12s
[...]          [...]                                                      Running     80s
[...]          [...]                                                      Running     24s
[...]          [...]                                                      Running     14s
[...]          [...]                                                      Running     87s
[...]          [...]                                                      Running     102s
[...]          [...]                                                      Running     85s
[...]          [...]                                                      Running     21s
[...]          [...]                                                      Running     13s
[...]          [...]                                                      Running     97s
[...]          [...]                                                      Running     97s
[...]          [...]                                                      Running     13s
[...]          [...]                                                      Running     85s
[...]          [...]                                                      Running     21s
[...]          [...]                                                      Succeeded   44s
[...]          [...]                                                      Running     18s
[...]          [...]                                                      Running     28s
[...]          [...]                                                      Succeeded   50s
[...]          [...]                                                      Running     11s
[...]          [...]                                                      Failed      61m
[...]          [...]                                                      Failed      46m
[...]          [...]                                                      Failed      31m

Meaning: I have no idea, why Argo thinks there are >2k failed workflows.

EDIT: after a reset of the workflow controller, it falls back to 0

Wouldn't that imply that the workflow controller has some in-memory state which doesn't necessarily reflect the reality?

Joibel commented 4 months ago

Wouldn't that imply that the workflow controller has some in-memory state which doesn't necessarily reflect the reality?

It would.

But the implementation of it is such that this should be impossible - it's using the very well tested informer pattern and checking the number of items in the informer.

Can you provide a way I can reproduce this. Something to do with the workflow controller ConfigMap? I don't mean just throw me the whole ConfigMap, but a minimal reproduction of "If I install it like this and run these I get the bug".

I'm stuck to help without this as I cannot reproduce it myself, and I have tried code inspection.

static-moonlight commented 3 months ago

Can you provide a way I can reproduce this. Something to do with the workflow controller ConfigMap? I don't mean just throw me the whole ConfigMap, but a minimal reproduction of "If I install it like this and run these I get the bug".

Honestly, I can't think of anything. There is nothing special about my Argo setup:

I use the cluster install manifest as-is (currently with version 3.5.8)
I deploy a Postgres database
I deploy a (nodeport) service to access the Argo server
I deploy a service to access Argo's metrics port
I also set/override the env variables:
- FIRST_TIME_USER_MODAL=false
- FEEDBACK_MODAL=false
- NEW_VERSION_MODAL=false
- ARGO_SECURE=false
... and set the http scheme for the readiness probe to http

This is my config:

metricsConfig:
  enabled: true
  path: /metrics
  port: 9090
persistence:
  archive: false
  connectionPool:
    maxIdleConns: 100
    maxOpenConns: 0
    connMaxLifetime: 0s
  postgresql:
    host: database.argo
    port: 5432
    database: argo
    tableName: argo_workflows
    userNameSecret:
      name: database
      key: USERNAME
    passwordSecret:
      name: database
      key: PASSWORD
artifactRepository:
  s3:
    bucket: argo-artifacts
    endpoint: s3.storage:9999
    insecure: true
    accessKeySecret:
      name: artifact-repository
      key: USERNAME
    secretKeySecret:
      name: artifact-repository
      key: PASSWORD
workflowDefaults:
  spec:
    ttlStrategy:
      secondsAfterCompletion: 84600 # keep completed workflows for 1 day
      secondsAfterSuccess: 84600 # keep successful workflows for 1 day
      secondsAfterFailure: 604800 # keep failed workflows for 1 week

The rest is pretty much based on default settings.

I really like to give you more information, but I'm not sure what that could be and where to get it? Are there some log entries I could search for? Would I have to adjust the log level to make them visible?

Maybe this is a long game ... it would be fine though ... lets say you would put in some additional debug logging or something, and with the next version I'll check what they say in my deployment?

static-moonlight commented 2 months ago

For now I've set up and deployed a custom metrics exporter, which extracts the values directly from the Kubernetes API. It's a workaround, but at least my monitoring is now working. The original Argo metrics still give me false readings.

argoproj / argo-workflows