Open static-moonlight opened 4 months ago
Follow-up to #13249
It seems like that deleted workflows due to gc also counts in metric argo_workflows_count
.
I misunderstood the way argo_workflows_count
were being calculated, my bad...
This should be a live glimpse into the informer's view of the cluster. Do you have a specific set of steps to get this to be wrong?
This should be a live glimpse into the informer's view of the cluster. Do you have a specific set of steps to get this to be wrong?
I don't know what you mean by "specific set of steps"? Can you provide a command to query for those "specific set of steps"? I didn't have any failed workflow within the last 3 days, and the metric still reports 2192 failed workflows. If there is anything lying around, I don't know where.
I meant a set of steps to reproduce this. If I take an empty cluster, what should I do to see the wrong metrics?
I cannot see the same thing locally with a quick test in k3d, nor in our production system. The numbers match in the UI, cluster and metrics. Version 3.5.8. The vast majority of our workflows are immediately garbage collected on completion, so we have a lot of GC happening.
I can't really have a bulletproof procedure for that except:
As I said, I've done that, and our CI production instance is basically doing this, including failing. My metrics fall back to baseline numbers as expected when nothing is happening.
If you restart the workflow controller do the numbers go back to low numbers, or do they stay high?
I think they will reset, when I restart the workflow controller ... I'm checking right now
The last restart of my workflow controller was 2 weeks ago:
$ kubectl get pods -n argo
NAME READY STATUS RESTARTS AGE
argo-server-7569f7f459-k5q69 1/1 Running 0 13d
database-0 1/1 Running 0 13d
workflow-controller-76ffbdcf8f-xhx2f 1/1 Running 0 12d
... since then, metrics just build up:
sometime they go down a little bit, but they don't fallback to 0.
The current state of my (test) system:
$ kubectl get workflows -A
NAMESPACE NAME STATUS AGE MESSAGE
[...] [...] Running 1s
[...] [...]
[...] [...] Succeeded 60s
[...] [...] Running 93s
[...] [...] Running 18s
[...] [...] Running 12s
[...] [...] Running 84s
[...] [...] Running 17s
[...] [...] Running 92s
[...] [...] Running 12s
[...] [...] Running 80s
[...] [...] Running 24s
[...] [...] Running 14s
[...] [...] Running 87s
[...] [...] Running 102s
[...] [...] Running 85s
[...] [...] Running 21s
[...] [...] Running 13s
[...] [...] Running 97s
[...] [...] Running 97s
[...] [...] Running 13s
[...] [...] Running 85s
[...] [...] Running 21s
[...] [...] Succeeded 44s
[...] [...] Running 18s
[...] [...] Running 28s
[...] [...] Succeeded 50s
[...] [...] Running 11s
[...] [...] Failed 61m
[...] [...] Failed 46m
[...] [...] Failed 31m
Meaning: I have no idea, why Argo thinks there are >2k failed workflows.
EDIT: after a reset of the workflow controller, it falls back to 0
Wouldn't that imply that the workflow controller has some in-memory state which doesn't necessarily reflect the reality?
Wouldn't that imply that the workflow controller has some in-memory state which doesn't necessarily reflect the reality?
It would.
But the implementation of it is such that this should be impossible - it's using the very well tested informer pattern and checking the number of items in the informer.
Can you provide a way I can reproduce this. Something to do with the workflow controller ConfigMap? I don't mean just throw me the whole ConfigMap, but a minimal reproduction of "If I install it like this and run these I get the bug".
I'm stuck to help without this as I cannot reproduce it myself, and I have tried code inspection.
Can you provide a way I can reproduce this. Something to do with the workflow controller ConfigMap? I don't mean just throw me the whole ConfigMap, but a minimal reproduction of "If I install it like this and run these I get the bug".
Honestly, I can't think of anything. There is nothing special about my Argo setup:
FIRST_TIME_USER_MODAL=false
FEEDBACK_MODAL=false
NEW_VERSION_MODAL=false
ARGO_SECURE=false
http
This is my config:
metricsConfig:
enabled: true
path: /metrics
port: 9090
persistence:
archive: false
connectionPool:
maxIdleConns: 100
maxOpenConns: 0
connMaxLifetime: 0s
postgresql:
host: database.argo
port: 5432
database: argo
tableName: argo_workflows
userNameSecret:
name: database
key: USERNAME
passwordSecret:
name: database
key: PASSWORD
artifactRepository:
s3:
bucket: argo-artifacts
endpoint: s3.storage:9999
insecure: true
accessKeySecret:
name: artifact-repository
key: USERNAME
secretKeySecret:
name: artifact-repository
key: PASSWORD
workflowDefaults:
spec:
ttlStrategy:
secondsAfterCompletion: 84600 # keep completed workflows for 1 day
secondsAfterSuccess: 84600 # keep successful workflows for 1 day
secondsAfterFailure: 604800 # keep failed workflows for 1 week
The rest is pretty much based on default settings.
I really like to give you more information, but I'm not sure what that could be and where to get it? Are there some log entries I could search for? Would I have to adjust the log level to make them visible?
Maybe this is a long game ... it would be fine though ... lets say you would put in some additional debug logging or something, and with the next version I'll check what they say in my deployment?
For now I've set up and deployed a custom metrics exporter, which extracts the values directly from the Kubernetes API. It's a workaround, but at least my monitoring is now working. The original Argo metrics still give me false readings.
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened/what did you expect to happen?
The metric
argo_workflows_count
, provided by the/metrics
endpoint of the workflow controller is not consistent with the current state of my Argo Workflows instance, what the user interface shows and what can be extracted usingkubectl
:The output of
kubectl get workflows -A
is consistent with the ui. The metricargo_workflows_count
however shows something completely different. I would expect that the metrics and the values in the user interface are identical (aside from a 15 sec delay). Technically speaking: that the values come from the same origin.relevant part of my config:
Version
3.5.8
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container