Closed morimoto-cybozu closed 3 years ago
Hello, thanks for detailed report! Need to investigate it, looks like bug.
@morimoto-cybozu , could you build vmagent
from the commit eddba29664eb63214ca529bfa5d24426b91d70cd and verify whether it fixes the issue? See build instructions for vmagent.
The previous commit introduced deadlock when discovering Kubernetes targets for role: endpoints
or role: endpointslices
. The commit 2ec7d8b38441548dba6bc4d4c0690604d3412395 should fix the deadlock.
The bugfix has been included in vmagent v1.59.0. @morimoto-cybozu , could you verify whether it works as expected now?
@valyala Sorry for the late reply. I executed the deployment test five times, and got two successes and three failures. I'm confounded at these results. The patch you sent was an eye-opener for me. I believe we are getting closer to the core of the matter.
The version info:
$ kubectl -n monitoring exec $POD -c vmagent -- /vmagent --version
vmagent-20210503-140944-v1.59.0
The log lines extracted from the succeeded environment:
:
2021-05-03T18:23:51.605Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:394 started endpoints watcher for "https://10.68.0.1:443/api/v1/namespaces/monitoring/endpoints"
2021-05-03T18:23:51.624Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:500 reloaded 13 objects from "https://10.68.0.1:443/api/v1/namespaces/monitoring/endpoints"
2021-05-03T18:23:51.625Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:394 started pod watcher for "https://10.68.0.1:443/api/v1/namespaces/monitoring/pods"
2021-05-03T18:23:51.835Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:500 reloaded 29 objects from "https://10.68.0.1:443/api/v1/namespaces/monitoring/pods"
2021-05-03T18:23:51.835Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:394 started service watcher for "https://10.68.0.1:443/api/v1/namespaces/monitoring/services"
2021-05-03T18:23:52.086Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:500 reloaded 13 objects from "https://10.68.0.1:443/api/v1/namespaces/monitoring/services"
:
2021-05-03T18:23:57.171Z info VictoriaMetrics/lib/promscrape/config.go:69 started service discovery routines in 6.691 seconds
2021-05-03T18:23:57.238Z info VictoriaMetrics/lib/promscrape/scraper.go:356 kubernetes_sd_configs: added targets: 129, removed targets: 0; total targets: 129
:
2021-05-03T18:37:57.178Z info VictoriaMetrics/lib/promscrape/scraper.go:356 kubernetes_sd_configs: added targets: 2, removed targets: 0; total targets: 177
:
From the failed environment:
:
2021-05-03T18:20:05.016Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:394 started endpoints watcher for "https://10.68.0.1:443/api/v1/namespaces/monitoring/endpoints"
2021-05-03T18:20:05.220Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:500 reloaded 10 objects from "https://10.68.0.1:443/api/v1/namespaces/monitoring/endpoints"
2021-05-03T18:20:05.220Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:394 started pod watcher for "https://10.68.0.1:443/api/v1/namespaces/monitoring/pods"
2021-05-03T18:20:06.085Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:500 reloaded 26 objects from "https://10.68.0.1:443/api/v1/namespaces/monitoring/pods"
2021-05-03T18:20:06.085Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:394 started service watcher for "https://10.68.0.1:443/api/v1/namespaces/monitoring/services"
2021-05-03T18:20:06.646Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:500 reloaded 10 objects from "https://10.68.0.1:443/api/v1/namespaces/monitoring/services"
:
2021-05-03T18:20:20.422Z info VictoriaMetrics/lib/promscrape/config.go:69 started service discovery routines in 16.628 seconds
2021-05-03T18:20:20.491Z info VictoriaMetrics/lib/promscrape/scraper.go:356 kubernetes_sd_configs: added targets: 129, removed targets: 0; total targets: 129
:
2021-05-03T18:34:20.425Z info VictoriaMetrics/lib/promscrape/scraper.go:356 kubernetes_sd_configs: added targets: 8, removed targets: 0; total targets: 172
:
@morimoto-cybozu , thanks for the update. Could you try vmagent
from the commit 904bbffc7f79fc54a7ed10162f3da549c05a36b3 ?
@morimoto-cybozu i've prepared docker image: victoriametrics/vmagent:heads-master-0-gd9e3872b1
built from master branch. For me it seems to work correctly.
@f41gh7 @valyala Many thanks! I've tested with 904bbff and got 1 failure.
(vmagent was running for about 1 hour)
$ kubectl -n monitoring get pod
...
vmagent-vmagent-largeset-59d5fc4f58-85gbg 2/2 Running 0 62m
...
(vmagent did not recognize the target)
$ kubectl -n monitoring exec $(kubectl -n monitoring get pod -l app.kubernetes.io/instance=vmagent-largeset -o json | jq -r '.items[0].metadata.name') -c vmagent -- curl -s http://localhost:8429/api/v1/targets | jq . | grep 'job.*vmalertmanager-smallset'
(empty)
(the container image was built with 904bbff)
$ kubectl -n monitoring exec $(kubectl -n monitoring get pod -l app.kubernetes.io/instance=vmagent-largeset -o json | jq -r '.items[0].metadata.name') -c vmagent -- /vmagent --version
vmagent-20210506-085731-tags-v1.59.0-9-g904bbffc
(vmagent recognized the target after reboot)
$ kubectl -n monitoring rollout restart deployment vmagent-vmagent-largeset
$ kubectl -n monitoring exec $(kubectl -n monitoring get pod -l app.kubernetes.io/instance=vmagent-largeset -o json | jq -r '.items[0].metadata.name') -c vmagent -- curl -s http://localhost:8429/api/v1/targets | jq . | grep 'job.*vmalertmanager-smallset'
"job": "monitoring/vmalertmanager-smallset/0",
"job": "vmalertmanager-smallset",
"job": "monitoring/vmalertmanager-smallset/0",
"job": "monitoring/vmalertmanager-smallset/0",
"job": "monitoring/vmalertmanager-smallset/0",
@f41gh7 @valyala Many thanks! I've tested with 904bbff and got 1 failure.
(vmagent was running for about 1 hour) $ kubectl -n monitoring get pod ... vmagent-vmagent-largeset-59d5fc4f58-85gbg 2/2 Running 0 62m ... (vmagent did not recognize the target) $ kubectl -n monitoring exec $(kubectl -n monitoring get pod -l app.kubernetes.io/instance=vmagent-largeset -o json | jq -r '.items[0].metadata.name') -c vmagent -- curl -s http://localhost:8429/api/v1/targets | jq . | grep 'job.*vmalertmanager-smallset' (empty) (the container image was built with 904bbff) $ kubectl -n monitoring exec $(kubectl -n monitoring get pod -l app.kubernetes.io/instance=vmagent-largeset -o json | jq -r '.items[0].metadata.name') -c vmagent -- /vmagent --version vmagent-20210506-085731-tags-v1.59.0-9-g904bbffc (vmagent recognized the target after reboot) $ kubectl -n monitoring rollout restart deployment vmagent-vmagent-largeset $ kubectl -n monitoring exec $(kubectl -n monitoring get pod -l app.kubernetes.io/instance=vmagent-largeset -o json | jq -r '.items[0].metadata.name') -c vmagent -- curl -s http://localhost:8429/api/v1/targets | jq . | grep 'job.*vmalertmanager-smallset' "job": "monitoring/vmalertmanager-smallset/0", "job": "vmalertmanager-smallset", "job": "monitoring/vmalertmanager-smallset/0", "job": "monitoring/vmalertmanager-smallset/0", "job": "monitoring/vmalertmanager-smallset/0",
May i ask you to dump content of /api/v1/targets
at vmagent for failed job and share it?
For now, i think, that vmagent get updates for endpoints before it get update for corresponding pods. It leads to improper labels, which prevents vmalertmanager targets discovery.
@f41gh7 I uploaded the targets response and related files to the Google Drive. https://drive.google.com/file/d/1lt1qH6h-TXjXEdG9mEWa8QiCViBNAMPy/view?usp=sharing
They are not confidential data. I hesitate to paste them in this issue just because they are too long. If this is not appropriate for sharing my problem, please let me know.
@morimoto-cybozu , could you check whether vmagent
built from the commit cfd6aa28e1dc8bbd68a7677686be7daa6008ab33 fixes the issue? This commit refreshes all the endpoints
and endpointslices
scrape targets once per 5 seconds, so all the changes to the dependent pod
and service
objects must eventually propagate to endpoints
and endpointslices
scrape targets in under 5 seconds.
@valyala I tried with cfd6aa2 6 times. The results were 3 succeeds and 3 failures.
There was a slight difference in successful cases.
Before cfd6aa2, the result of curl .../targets | jq . | grep ...
started from about 40 lines and diminished to 5 lines. With cfd5aa2, curl & grep constantly returned about 50 lines. In both cases, the most of the lines were output as droppedTargets
.
So refreshing all scrape targets seems to take some effect indeed. I don't have a concrete idea for improvement, though.
@morimoto-cybozu , thanks for the update! Could you share both succeeded and failed output for the /targets
for vmagent
built from cfd6aa2 ? This will help identifying the root cause for the incorrect discovery of scrape targets.
@valyala Surely. I uploaded the targets response and logs of (1) the succeeded instance, (2) the failed instance, and (3) the restarted instance in (2)'s environment. I hope this helps. https://drive.google.com/file/d/1j1LB7xFMdJncatpZcAfHZoHlT1G6mqcY/view?usp=sharing
@morimoto-cybozu , thanks for the provided information!
I investigated which jobs are missing in targets.failed
output compared to targets.succeded
output with the following commands:
cat targets.failed | jq '.data.activeTargets[].labels.job' | sort > targets.failed.jobs
cat targets.succeeded | jq '.data.activeTargets[].labels.job' | sort > targets.succeeded.jobs
diff -u targets.succeeded.jobs targets.failed.jobs
--- targets.succeeded.jobs 2021-05-14 13:55:28.516428137 +0300
+++ targets.failed.jobs 2021-05-14 13:55:17.704519665 +0300
@@ -167,21 +167,3 @@
"vmagent-largeset"
"vmagent-largeset"
"vmagent-smallset"
-"vmalert-largeset"
-"vmalert-largeset"
-"vmalert-largeset"
-"vmalertmanager-largeset"
-"vmalertmanager-largeset"
-"vmalertmanager-largeset"
-"vmalertmanager-smallset"
-"vmalert-smallset"
-"vminsert-largeset"
-"vminsert-largeset"
-"vminsert-largeset"
-"vmselect-largeset"
-"vmselect-largeset"
-"vmselect-largeset"
-"vmsingle-smallset"
-"vmstorage-largeset"
-"vmstorage-largeset"
-"vmstorage-largeset"
It looks like the targets.failed
has no only VictoriaMetrics-related targets. Probably these targets weren't registered in Kubernetes for some reason. This doesn't look like an issue in service discovery at vmagent
side.
Another findings for targets dropped due to relabeling rules:
cat targets.succeeded | jq '.data.droppedTargets[].discoveredLabels.job' | sort | uniq -c > jobs.dropped.succeeded
cat targets.failed | jq '.data.droppedTargets[].discoveredLabels.job' | sort | uniq -c > jobs.dropped.failed
diff -u jobs.dropped.succeeded jobs.dropped.failed
--- jobs.dropped.succeeded 2021-05-14 16:24:30.115907198 +0300
+++ jobs.dropped.failed 2021-05-14 16:24:37.563813337 +0300
@@ -4,18 +4,10 @@
57 "monitoring/contour/0"
44 "monitoring/node/0"
44 "monitoring/node/1"
- 31 "monitoring/pushgateway/0"
+ 35 "monitoring/pushgateway/0"
41 "monitoring/rook/0"
51 "monitoring/sabakan/0"
51 "monitoring/vault/0"
- 31 "monitoring/victoriametrics-operator/0"
+ 36 "monitoring/victoriametrics-operator/0"
51 "monitoring/vmagent-largeset/0"
53 "monitoring/vmagent-smallset/0"
- 51 "monitoring/vmalert-largeset/0"
- 51 "monitoring/vmalertmanager-largeset/0"
- 53 "monitoring/vmalertmanager-smallset/0"
- 53 "monitoring/vmalert-smallset/0"
- 51 "monitoring/vminsert-largeset/0"
- 51 "monitoring/vmselect-largeset/0"
- 53 "monitoring/vmsingle-smallset/0"
- 51 "monitoring/vmstorage-largeset/0"
It looks like some jobs defined in the scrape configs didn't receive any updates during failed service discovery, while other jobs such as monitoring/victoriametrics-operator/0
or monitoring/pushgateway/0
receive only partial updates. This case needs additional investigaton.
@morimoto-cybozu , could you run Prometheus with the same scrape configs and verify whether it always discovers the same set of scrape targets as vmagent
in targets.succeeded
case?
Could you also try running vmagent
from the commit 2eed410466d335598415429a57567ab6a7ac6d1d and verify whether it fixes the issue with service discovery?
Could you also share graphs for the following metrics collected from vmagent (see these docs about vmagent metrics) during succeeded and failed service discovery:
vm_promscrape_discovery_kubernetes_url_watchers
vm_promscrape_discovery_kubernetes_subscribers
vm_promscrape_discovery_kubernetes_objects
vm_promscrape_discovery_kubernetes_objects_added_total
vm_promscrape_discovery_kubernetes_objects_removed_total
vm_promscrape_discovery_kubernetes_objects_updated_total
vm_promscrape_discovery_kubernetes_stale_resource_versions_total
@valyala I'm trying to make an experiment again, and I'm sorry to say that it will take a few days.
One thing I want to point out is that the scraping targets were detected successfully in the failed environment only by restarting vmagent Pods. This would mean that the targets had been already registered in the K8s API server correctly in that environment.
Thanks for additional details!
It looks like vmagent
didn't re-read the -promscrape.config
file on the first SIGHUP
signal according to logs from https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1240#issuecomment-840977886 . This could result in incomplete list of scrape jobs if the initial config file had no some VictoriaMetrics-related jobs. This should be fixed in the commit c54bb73867ae0024b5acb0bcaa611a129b012abc . Could you verify whether vmagent
built from this commit correctly discovers scrape targets on the first run?
Also, if solution above didnt help. I've made build with logging of vmagent discovery, it must help identify source of issue with missing target labels. May i ask to run it and provide log file?
docker image: victoriametrics/vmagent:heads-vmagent-logging-0-g6b9080c03
, its built from branch https://github.com/VictoriaMetrics/VictoriaMetrics/tree/vmagent-discovery-logging
FYI, the bugfix related to missing SIGHUP
handling has been added in vmagent v1.60.0.
@f41gh7 @valyala I executed my CI test 8 times with v1.60.0, and got 8 succeeds and 0 failures. Great! I really appreciate your consideration and support!
I've got the following 2 types of logs. In both cases, I found that the desired targets were detected.
Type 1: 'found changes in "/etc/vmagent/config_out/vmagent.env.yaml"'
...
2021-05-26T14:51:43.732Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378 started node watcher for "https://10.68.0.1:443/api/v1/nodes"
2021-05-26T14:51:43.785Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467 reloaded 7 objects from "https://10.68.0.1:443/api/v1/nodes"; updated=0, removed=0, added=7, resourceVersion="10672"
2021-05-26T14:51:43.786Z info VictoriaMetrics/lib/promscrape/config.go:69 started service discovery routines in 3.115 seconds
2021-05-26T14:51:43.868Z info VictoriaMetrics/lib/promscrape/scraper.go:359 kubernetes_sd_configs: added targets: 131, removed targets: 0; total targets: 131
2021-05-26T14:51:45.046Z info VictoriaMetrics/lib/promscrape/scraper.go:126 SIGHUP received; reloading Prometheus configs from "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:51:45.046Z info VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:148 SIGHUP received; reloading relabel configs pointed by -remoteWrite.relabelConfig and -remoteWrite.urlRelabelConfig
2021-05-26T14:51:45.046Z info VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:155 Successfully reloaded relabel configs
2021-05-26T14:51:45.072Z info VictoriaMetrics/lib/promscrape/scraper.go:133 nothing changed in "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:51:46.052Z info VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:148 SIGHUP received; reloading relabel configs pointed by -remoteWrite.relabelConfig and -remoteWrite.urlRelabelConfig
2021-05-26T14:51:46.052Z info VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:155 Successfully reloaded relabel configs
2021-05-26T14:51:46.052Z info VictoriaMetrics/lib/promscrape/scraper.go:126 SIGHUP received; reloading Prometheus configs from "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:51:46.071Z info VictoriaMetrics/lib/promscrape/config.go:74 stopping service discovery routines...
2021-05-26T14:51:46.071Z info VictoriaMetrics/lib/promscrape/config.go:78 stopped service discovery routines in 0.000 seconds
2021-05-26T14:51:46.071Z info VictoriaMetrics/lib/promscrape/config.go:65 starting service discovery routines...
2021-05-26T14:51:46.076Z info VictoriaMetrics/lib/promscrape/config.go:69 started service discovery routines in 0.005 seconds
2021-05-26T14:51:46.077Z info VictoriaMetrics/lib/promscrape/scraper.go:162 found changes in "/etc/vmagent/config_out/vmagent.env.yaml"; applying these changes
2021-05-26T14:51:46.143Z info VictoriaMetrics/lib/promscrape/scraper.go:359 kubernetes_sd_configs: added targets: 1, removed targets: 0; total targets: 132
...
Type 2: 'nothing changed in "/etc/vmagent/config_out/vmagent.env.yaml"'
...
2021-05-26T14:52:56.960Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378 started pod watcher for "https://10.68.0.1:443/api/v1/namespaces/external-dns/pods"
2021-05-26T14:52:57.060Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467 reloaded 0 objects from "https://10.68.0.1:443/api/v1/namespaces/external-dns/pods"; updated=0, removed=0, added=0, resourceVersion="10438"
2021-05-26T14:52:57.061Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378 started pod watcher for "https://10.68.0.1:443/api/v1/namespaces/logging/pods"
2021-05-26T14:52:57.163Z info VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:148 SIGHUP received; reloading relabel configs pointed by -remoteWrite.relabelConfig and -remoteWrite.urlRelabelConfig
2021-05-26T14:52:57.163Z info VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:155 Successfully reloaded relabel configs
2021-05-26T14:52:57.298Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467 reloaded 36 objects from "https://10.68.0.1:443/api/v1/namespaces/logging/pods"; updated=0, removed=0, added=36, resourceVersion="10442"
2021-05-26T14:52:57.299Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378 started pod watcher for "https://10.68.0.1:443/api/v1/namespaces/metallb-system/pods"
2021-05-26T14:52:57.393Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467 reloaded 5 objects from "https://10.68.0.1:443/api/v1/namespaces/metallb-system/pods"; updated=0, removed=0, added=5, resourceVersion="10444"
...
2021-05-26T14:52:57.590Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378 started node watcher for "https://10.68.0.1:443/api/v1/nodes"
2021-05-26T14:52:57.647Z info VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467 reloaded 7 objects from "https://10.68.0.1:443/api/v1/nodes"; updated=0, removed=0, added=7, resourceVersion="10448"
2021-05-26T14:52:57.648Z info VictoriaMetrics/lib/promscrape/config.go:69 started service discovery routines in 5.893 seconds
2021-05-26T14:52:57.648Z info VictoriaMetrics/lib/promscrape/scraper.go:126 SIGHUP received; reloading Prometheus configs from "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:52:57.702Z info VictoriaMetrics/lib/promscrape/scraper.go:133 nothing changed in "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:52:57.755Z info VictoriaMetrics/lib/promscrape/scraper.go:359 kubernetes_sd_configs: added targets: 129, removed targets: 0; total targets: 129
...
Thanks for good news! Then closing the issue as fixed.
Describe the bug
VMAgent does not start scraping for some target Endpoints though there exists the Endpoints object which matches the label selector in the VMAgent config file.
To Reproduce
I'm afraid to say that I don't have simple and concrete steps for reproduction. The issue does not seem highly reproducible. I have tested the same configuration in a CI test workflow for multiple times, and seen 3 successes and 4 failures. The K8s cluster is created from scratch for each CI test using virtual machines. The configuration is not so simple. I'm using the VictoriaMetrics operator with two VM instances; one is defined by VMSingle and the other by VMCluster. The two VM instances monitor each other, and the VMCluster instance monitors many other metrics in the K8s cluster. Under this configuration, the VMCluster instance fails to start scraping for the AlertManager of the VMSingle instance.
The configuration is working well with VM 1.55.1, and fails with VM 1.58.0.
I have had a similar issue with VM 1.57.2. I could solve the problem by updating target K8s objects, e.g. adding an annotation to the Endpoints. I tried the same thing for VM 1.58.0 with no luck. As a last resort, restarting the VMAgent Pods solves the problem.
Expected behavior
VMAgent starts scraping, even without reboot, for the targets which are specified in the configuration file and exist in the K8s cluster.
Screenshots
The configuration file for the VMAgent in the VMCluster instance contains the following setting.
The VMAgent Pod is running and there exists the target Service/Endpoints. They were created almost at the same time.
And the VMAgent Pod does not recognize the target. The following
grep
returns a non-empty result in a successful environment.Version
Used command-line flags
Explicit command-line flags in the Pod definition:
Flags in
/metrics
:Additional context
There are 16 Endpoints objects in the target namespace finally. The VMAgent says it found 14 objects at startup.
The VMAgent continued to fail scraping for some other targets. This was because the metrics endpoints were not listened. I suppose this is not the cause of the issue.