VictoriaMetrics / VictoriaMetrics

VictoriaMetrics: fast, cost-effective monitoring solution and time series database
https://victoriametrics.com/
Apache License 2.0
12.32k stars 1.22k forks source link

vmagent: failed to recognize scraping targets #1240

Closed morimoto-cybozu closed 3 years ago

morimoto-cybozu commented 3 years ago

Describe the bug

VMAgent does not start scraping for some target Endpoints though there exists the Endpoints object which matches the label selector in the VMAgent config file.

To Reproduce

I'm afraid to say that I don't have simple and concrete steps for reproduction. The issue does not seem highly reproducible. I have tested the same configuration in a CI test workflow for multiple times, and seen 3 successes and 4 failures. The K8s cluster is created from scratch for each CI test using virtual machines. The configuration is not so simple. I'm using the VictoriaMetrics operator with two VM instances; one is defined by VMSingle and the other by VMCluster. The two VM instances monitor each other, and the VMCluster instance monitors many other metrics in the K8s cluster. Under this configuration, the VMCluster instance fails to start scraping for the AlertManager of the VMSingle instance.

The configuration is working well with VM 1.55.1, and fails with VM 1.58.0.

I have had a similar issue with VM 1.57.2. I could solve the problem by updating target K8s objects, e.g. adding an annotation to the Endpoints. I tried the same thing for VM 1.58.0 with no luck. As a last resort, restarting the VMAgent Pods solves the problem.

Expected behavior

VMAgent starts scraping, even without reboot, for the targets which are specified in the configuration file and exist in the K8s cluster.

Screenshots

The configuration file for the VMAgent in the VMCluster instance contains the following setting.

- job_name: monitoring/vmalertmanager-smallset/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app_kubernetes_io_component
    regex: monitoring
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app_kubernetes_io_instance
    regex: vmalertmanager-smallset
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app_kubernetes_io_name
    regex: vmalertmanager
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_managed_by
    regex: vm-operator
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http
  - source_labels:
    - __meta_kubernetes_endpoint_address_target_kind
    - __meta_kubernetes_endpoint_address_target_name
    separator: ;
    regex: Node;(.*)
    replacement: ${1}
    target_label: node
  - source_labels:
    - __meta_kubernetes_endpoint_address_target_kind
    - __meta_kubernetes_endpoint_address_target_name
    separator: ;
    regex: Pod;(.*)
    replacement: ${1}
    target_label: pod
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - target_label: endpoint
    replacement: http
  - target_label: job
    replacement: vmalertmanager-smallset

The VMAgent Pod is running and there exists the target Service/Endpoints. They were created almost at the same time.

$ kubectl -n monitoring get pod vmagent-vmagent-largeset-768f7c64d5-26ckz; kubectl -n monitoring get service vmalertmanager-vmalertmanager-smallset --show-labels; kubectl -n monitoring get endpoints vmalertmanager-vmalertmanager-smallset --show-labels
NAME                                        READY   STATUS    RESTARTS   AGE
vmagent-vmagent-largeset-768f7c64d5-26ckz   2/2     Running   0          63m
NAME                                     TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                      AGE   LABELS
vmalertmanager-vmalertmanager-smallset   ClusterIP   None         <none>        9093/TCP,9094/TCP,9094/UDP   63m   app.kubernetes.io/component=monitoring,app.kubernetes.io/instance=vmalertmanager-smallset,app.kubernetes.io/name=vmalertmanager,managed-by=vm-operator
NAME                                     ENDPOINTS                                            AGE   LABELS
vmalertmanager-vmalertmanager-smallset   10.64.0.190:9094,10.64.0.190:9094,10.64.0.190:9093   63m   app.kubernetes.io/component=monitoring,app.kubernetes.io/instance=vmalertmanager-smallset,app.kubernetes.io/name=vmalertmanager,managed-by=vm-operator,service.kubernetes.io/headless=

And the VMAgent Pod does not recognize the target. The following grep returns a non-empty result in a successful environment.

$ kubectl -n monitoring exec vmagent-vmagent-largeset-768f7c64d5-26ckz -c vmagent -- curl -s http://localhost:8429/api/v1/targets | jq . | grep 'job.*vmalertmanager-smallset'
(empty)

Version

$ kubectl -n monitoring exec vmagent-vmagent-largeset-768f7c64d5-26ckz -c vmagent -- /vmagent --version                                  vmagent-20210409-110604-v1.58.0

Used command-line flags

Explicit command-line flags in the Pod definition:

    - -httpListenAddr=:8429
    - -promscrape.config=/etc/vmagent/config_out/vmagent.env.yaml
    - -promscrape.maxScrapeSize=33554432
    - -remoteWrite.maxDiskUsagePerURL=1073741824
    - -remoteWrite.tmpDataPath=/tmp/vmagent-remotewrite-data
    - -remoteWrite.url=http://vminsert-vmcluster-largeset.monitoring.svc:8480/insert/0/prometheus/api/v1/write

Flags in /metrics:

$ kubectl -n monitoring exec vmagent-vmagent-largeset-768f7c64d5-26ckz -c vmagent -- curl -s http://localhost:8429/metrics | grep '^flag'
flag{name="csvTrimTimestamp", value="1ms"} 1
flag{name="dryRun", value="false"} 1
flag{name="enableTCP6", value="false"} 1
flag{name="envflag.enable", value="false"} 1
flag{name="envflag.prefix", value=""} 1
flag{name="fs.disableMmap", value="false"} 1
flag{name="graphiteListenAddr", value=""} 1
flag{name="graphiteTrimTimestamp", value="1s"} 1
flag{name="http.connTimeout", value="2m0s"} 1
flag{name="http.disableResponseCompression", value="false"} 1
flag{name="http.idleConnTimeout", value="1m0s"} 1
flag{name="http.maxGracefulShutdownDuration", value="7s"} 1
flag{name="http.pathPrefix", value=""} 1
flag{name="http.shutdownDelay", value="0s"} 1
flag{name="httpAuth.password", value="secret"} 1
flag{name="httpAuth.username", value=""} 1
flag{name="httpListenAddr", value=":8429"} 1
flag{name="import.maxLineLen", value="104857600"} 1
flag{name="influx.databaseNames", value=""} 1
flag{name="influx.maxLineSize", value="262144"} 1
flag{name="influxListenAddr", value=""} 1
flag{name="influxMeasurementFieldSeparator", value="_"} 1
flag{name="influxSkipMeasurement", value="false"} 1
flag{name="influxSkipSingleField", value="false"} 1
flag{name="influxTrimTimestamp", value="1ms"} 1
flag{name="insert.maxQueueDuration", value="1m0s"} 1
flag{name="loggerDisableTimestamps", value="false"} 1
flag{name="loggerErrorsPerSecondLimit", value="0"} 1
flag{name="loggerFormat", value="default"} 1
flag{name="loggerLevel", value="INFO"} 1
flag{name="loggerOutput", value="stderr"} 1
flag{name="loggerTimezone", value="UTC"} 1
flag{name="loggerWarnsPerSecondLimit", value="0"} 1
flag{name="maxConcurrentInserts", value="40"} 1
flag{name="maxInsertRequestSize", value="33554432"} 1
flag{name="memory.allowedBytes", value="0"} 1
flag{name="memory.allowedPercent", value="60"} 1
flag{name="metricsAuthKey", value="secret"} 1
flag{name="opentsdbHTTPListenAddr", value=""} 1
flag{name="opentsdbListenAddr", value=""} 1
flag{name="opentsdbTrimTimestamp", value="1s"} 1
flag{name="opentsdbhttp.maxInsertRequestSize", value="33554432"} 1
flag{name="opentsdbhttpTrimTimestamp", value="1ms"} 1
flag{name="pprofAuthKey", value="secret"} 1
flag{name="promscrape.cluster.memberNum", value="0"} 1
flag{name="promscrape.cluster.membersCount", value="0"} 1
flag{name="promscrape.cluster.replicationFactor", value="1"} 1
flag{name="promscrape.config", value="/etc/vmagent/config_out/vmagent.env.yaml"} 1
flag{name="promscrape.config.dryRun", value="false"} 1
flag{name="promscrape.config.strictParse", value="false"} 1
flag{name="promscrape.configCheckInterval", value="0s"} 1
flag{name="promscrape.consul.waitTime", value="0s"} 1
flag{name="promscrape.consulSDCheckInterval", value="30s"} 1
flag{name="promscrape.disableCompression", value="false"} 1
flag{name="promscrape.disableKeepAlive", value="false"} 1
flag{name="promscrape.discovery.concurrency", value="100"} 1
flag{name="promscrape.discovery.concurrentWaitTime", value="1m0s"} 1
flag{name="promscrape.dnsSDCheckInterval", value="30s"} 1
flag{name="promscrape.dockerswarmSDCheckInterval", value="30s"} 1
flag{name="promscrape.dropOriginalLabels", value="false"} 1
flag{name="promscrape.ec2SDCheckInterval", value="1m0s"} 1
flag{name="promscrape.eurekaSDCheckInterval", value="30s"} 1
flag{name="promscrape.fileSDCheckInterval", value="30s"} 1
flag{name="promscrape.gceSDCheckInterval", value="1m0s"} 1
flag{name="promscrape.kubernetes.apiServerTimeout", value="30m0s"} 1
flag{name="promscrape.kubernetesSDCheckInterval", value="30s"} 1
flag{name="promscrape.maxDroppedTargets", value="1000"} 1
flag{name="promscrape.maxScrapeSize", value="33554432"} 1
flag{name="promscrape.openstackSDCheckInterval", value="30s"} 1
flag{name="promscrape.streamParse", value="false"} 1
flag{name="promscrape.suppressDuplicateScrapeTargetErrors", value="false"} 1
flag{name="promscrape.suppressScrapeErrors", value="false"} 1
flag{name="remoteWrite.basicAuth.password", value="secret"} 1
flag{name="remoteWrite.basicAuth.username", value=""} 1
flag{name="remoteWrite.bearerToken", value="secret"} 1
flag{name="remoteWrite.flushInterval", value="1s"} 1
flag{name="remoteWrite.label", value=""} 1
flag{name="remoteWrite.maxBlockSize", value="8388608"} 1
flag{name="remoteWrite.maxDiskUsagePerURL", value="1073741824"} 1
flag{name="remoteWrite.proxyURL", value=""} 1
flag{name="remoteWrite.queues", value="4"} 1
flag{name="remoteWrite.rateLimit", value=""} 1
flag{name="remoteWrite.relabelConfig", value=""} 1
flag{name="remoteWrite.roundDigits", value=""} 1
flag{name="remoteWrite.sendTimeout", value=""} 1
flag{name="remoteWrite.showURL", value="false"} 1
flag{name="remoteWrite.significantFigures", value=""} 1
flag{name="remoteWrite.tlsCAFile", value=""} 1
flag{name="remoteWrite.tlsCertFile", value=""} 1
flag{name="remoteWrite.tlsInsecureSkipVerify", value=""} 1
flag{name="remoteWrite.tlsKeyFile", value="secret"} 1
flag{name="remoteWrite.tlsServerName", value=""} 1
flag{name="remoteWrite.tmpDataPath", value="/tmp/vmagent-remotewrite-data"} 1
flag{name="remoteWrite.url", value="secret"} 1
flag{name="remoteWrite.urlRelabelConfig", value=""} 1
flag{name="sortLabels", value="false"} 1
flag{name="tls", value="false"} 1
flag{name="tlsCertFile", value=""} 1
flag{name="tlsKeyFile", value="secret"} 1
flag{name="version", value="false"} 1

Additional context

There are 16 Endpoints objects in the target namespace finally. The VMAgent says it found 14 objects at startup.

$ kubectl -n monitoring get endpoints --no-headers | wc -l
16
$ kubectl -n monitoring logs vmagent-vmagent-largeset-768f7c64d5-26ckz -c vmagent
(snip)
2021-04-22T10:00:20.675Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:419  started endpoints watcher for "https://10.68.0.1:443/api/v1/namespaces/monitoring/endpoints"
2021-04-22T10:00:20.954Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:522  reloaded 14 objects from "https://10.68.0.1:443/api/v1/namespaces/monitoring/endpoints"
(snip)

The VMAgent continued to fail scraping for some other targets. This was because the metrics endpoints were not listened. I suppose this is not the cause of the issue.

2021-04-22T10:45:07.447Z        error   VictoriaMetrics/lib/promscrape/scrapework.go:258        error when scraping "http://10.72.48.65:8080/metrics" from job "coil-egress" with labels {container="egress",endpoint="metrics",instance="10.72.48.65:8080",job="coil-egress",namespace="domestic-egress",pod="nat-57bbb5f748-k6bp4",prometheus="monitoring/vmagent-largeset"}: error when scraping "http://10.72.48.65:8080/metrics": dial tcp4 10.72.48.65:8080: i/o timeout; try -enableTCP6 command-line flag if you scrape ipv6 addresses
(and the same error messages...)
f41gh7 commented 3 years ago

Hello, thanks for detailed report! Need to investigate it, looks like bug.

valyala commented 3 years ago

@morimoto-cybozu , could you build vmagent from the commit eddba29664eb63214ca529bfa5d24426b91d70cd and verify whether it fixes the issue? See build instructions for vmagent.

valyala commented 3 years ago

The previous commit introduced deadlock when discovering Kubernetes targets for role: endpoints or role: endpointslices. The commit 2ec7d8b38441548dba6bc4d4c0690604d3412395 should fix the deadlock.

valyala commented 3 years ago

The bugfix has been included in vmagent v1.59.0. @morimoto-cybozu , could you verify whether it works as expected now?

morimoto-cybozu commented 3 years ago

@valyala Sorry for the late reply. I executed the deployment test five times, and got two successes and three failures. I'm confounded at these results. The patch you sent was an eye-opener for me. I believe we are getting closer to the core of the matter.

The version info:

$ kubectl -n monitoring exec $POD -c vmagent -- /vmagent --version
vmagent-20210503-140944-v1.59.0

The log lines extracted from the succeeded environment:

:
2021-05-03T18:23:51.605Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:394  started endpoints watcher for "https://10.68.0.1:443/api/v1/namespaces/monitoring/endpoints"
2021-05-03T18:23:51.624Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:500  reloaded 13 objects from "https://10.68.0.1:443/api/v1/namespaces/monitoring/endpoints"
2021-05-03T18:23:51.625Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:394  started pod watcher for "https://10.68.0.1:443/api/v1/namespaces/monitoring/pods"
2021-05-03T18:23:51.835Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:500  reloaded 29 objects from "https://10.68.0.1:443/api/v1/namespaces/monitoring/pods"
2021-05-03T18:23:51.835Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:394  started service watcher for "https://10.68.0.1:443/api/v1/namespaces/monitoring/services"
2021-05-03T18:23:52.086Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:500  reloaded 13 objects from "https://10.68.0.1:443/api/v1/namespaces/monitoring/services"
:
2021-05-03T18:23:57.171Z        info    VictoriaMetrics/lib/promscrape/config.go:69     started service discovery routines in 6.691 seconds
2021-05-03T18:23:57.238Z        info    VictoriaMetrics/lib/promscrape/scraper.go:356   kubernetes_sd_configs: added targets: 129, removed targets: 0; total targets: 129
:
2021-05-03T18:37:57.178Z        info    VictoriaMetrics/lib/promscrape/scraper.go:356   kubernetes_sd_configs: added targets: 2, removed targets: 0; total targets: 177
:

From the failed environment:

:
2021-05-03T18:20:05.016Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:394  started endpoints watcher for "https://10.68.0.1:443/api/v1/namespaces/monitoring/endpoints"
2021-05-03T18:20:05.220Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:500  reloaded 10 objects from "https://10.68.0.1:443/api/v1/namespaces/monitoring/endpoints"
2021-05-03T18:20:05.220Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:394  started pod watcher for "https://10.68.0.1:443/api/v1/namespaces/monitoring/pods"
2021-05-03T18:20:06.085Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:500  reloaded 26 objects from "https://10.68.0.1:443/api/v1/namespaces/monitoring/pods"
2021-05-03T18:20:06.085Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:394  started service watcher for "https://10.68.0.1:443/api/v1/namespaces/monitoring/services"
2021-05-03T18:20:06.646Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:500  reloaded 10 objects from "https://10.68.0.1:443/api/v1/namespaces/monitoring/services"
:
2021-05-03T18:20:20.422Z        info    VictoriaMetrics/lib/promscrape/config.go:69     started service discovery routines in 16.628 seconds
2021-05-03T18:20:20.491Z        info    VictoriaMetrics/lib/promscrape/scraper.go:356   kubernetes_sd_configs: added targets: 129, removed targets: 0; total targets: 129
:
2021-05-03T18:34:20.425Z        info    VictoriaMetrics/lib/promscrape/scraper.go:356   kubernetes_sd_configs: added targets: 8, removed targets: 0; total targets: 172
:
valyala commented 3 years ago

@morimoto-cybozu , thanks for the update. Could you try vmagent from the commit 904bbffc7f79fc54a7ed10162f3da549c05a36b3 ?

f41gh7 commented 3 years ago

@morimoto-cybozu i've prepared docker image: victoriametrics/vmagent:heads-master-0-gd9e3872b1 built from master branch. For me it seems to work correctly.

morimoto-cybozu commented 3 years ago

@f41gh7 @valyala Many thanks! I've tested with 904bbff and got 1 failure.

(vmagent was running for about 1 hour)
$ kubectl -n monitoring get pod
...
vmagent-vmagent-largeset-59d5fc4f58-85gbg      2/2     Running            0          62m
...

(vmagent did not recognize the target)
$ kubectl -n monitoring exec $(kubectl -n monitoring get pod -l app.kubernetes.io/instance=vmagent-largeset -o json | jq -r '.items[0].metadata.name') -c vmagent -- curl -s http://localhost:8429/api/v1/targets | jq . | grep 'job.*vmalertmanager-smallset'
(empty)

(the container image was built with 904bbff)
$ kubectl -n monitoring exec $(kubectl -n monitoring get pod -l app.kubernetes.io/instance=vmagent-largeset -o json | jq -r '.items[0].metadata.name') -c vmagent -- /vmagent --version
vmagent-20210506-085731-tags-v1.59.0-9-g904bbffc

(vmagent recognized the target after reboot)
$ kubectl -n monitoring rollout restart deployment vmagent-vmagent-largeset
$ kubectl -n monitoring exec $(kubectl -n monitoring get pod -l app.kubernetes.io/instance=vmagent-largeset -o json | jq -r '.items[0].metadata.name') -c vmagent -- curl -s http://localhost:8429/api/v1/targets | jq . | grep 'job.*vmalertmanager-smallset'
          "job": "monitoring/vmalertmanager-smallset/0",
          "job": "vmalertmanager-smallset",
          "job": "monitoring/vmalertmanager-smallset/0",
          "job": "monitoring/vmalertmanager-smallset/0",
          "job": "monitoring/vmalertmanager-smallset/0",
f41gh7 commented 3 years ago

@f41gh7 @valyala Many thanks! I've tested with 904bbff and got 1 failure.

(vmagent was running for about 1 hour)
$ kubectl -n monitoring get pod
...
vmagent-vmagent-largeset-59d5fc4f58-85gbg      2/2     Running            0          62m
...

(vmagent did not recognize the target)
$ kubectl -n monitoring exec $(kubectl -n monitoring get pod -l app.kubernetes.io/instance=vmagent-largeset -o json | jq -r '.items[0].metadata.name') -c vmagent -- curl -s http://localhost:8429/api/v1/targets | jq . | grep 'job.*vmalertmanager-smallset'
(empty)

(the container image was built with 904bbff)
$ kubectl -n monitoring exec $(kubectl -n monitoring get pod -l app.kubernetes.io/instance=vmagent-largeset -o json | jq -r '.items[0].metadata.name') -c vmagent -- /vmagent --version
vmagent-20210506-085731-tags-v1.59.0-9-g904bbffc

(vmagent recognized the target after reboot)
$ kubectl -n monitoring rollout restart deployment vmagent-vmagent-largeset
$ kubectl -n monitoring exec $(kubectl -n monitoring get pod -l app.kubernetes.io/instance=vmagent-largeset -o json | jq -r '.items[0].metadata.name') -c vmagent -- curl -s http://localhost:8429/api/v1/targets | jq . | grep 'job.*vmalertmanager-smallset'
          "job": "monitoring/vmalertmanager-smallset/0",
          "job": "vmalertmanager-smallset",
          "job": "monitoring/vmalertmanager-smallset/0",
          "job": "monitoring/vmalertmanager-smallset/0",
          "job": "monitoring/vmalertmanager-smallset/0",

May i ask you to dump content of /api/v1/targets at vmagent for failed job and share it?

For now, i think, that vmagent get updates for endpoints before it get update for corresponding pods. It leads to improper labels, which prevents vmalertmanager targets discovery.

morimoto-cybozu commented 3 years ago

@f41gh7 I uploaded the targets response and related files to the Google Drive. https://drive.google.com/file/d/1lt1qH6h-TXjXEdG9mEWa8QiCViBNAMPy/view?usp=sharing

They are not confidential data. I hesitate to paste them in this issue just because they are too long. If this is not appropriate for sharing my problem, please let me know.

valyala commented 3 years ago

@morimoto-cybozu , could you check whether vmagent built from the commit cfd6aa28e1dc8bbd68a7677686be7daa6008ab33 fixes the issue? This commit refreshes all the endpoints and endpointslices scrape targets once per 5 seconds, so all the changes to the dependent pod and service objects must eventually propagate to endpoints and endpointslices scrape targets in under 5 seconds.

morimoto-cybozu commented 3 years ago

@valyala I tried with cfd6aa2 6 times. The results were 3 succeeds and 3 failures.

There was a slight difference in successful cases. Before cfd6aa2, the result of curl .../targets | jq . | grep ... started from about 40 lines and diminished to 5 lines. With cfd5aa2, curl & grep constantly returned about 50 lines. In both cases, the most of the lines were output as droppedTargets. So refreshing all scrape targets seems to take some effect indeed. I don't have a concrete idea for improvement, though.

valyala commented 3 years ago

@morimoto-cybozu , thanks for the update! Could you share both succeeded and failed output for the /targets for vmagent built from cfd6aa2 ? This will help identifying the root cause for the incorrect discovery of scrape targets.

morimoto-cybozu commented 3 years ago

@valyala Surely. I uploaded the targets response and logs of (1) the succeeded instance, (2) the failed instance, and (3) the restarted instance in (2)'s environment. I hope this helps. https://drive.google.com/file/d/1j1LB7xFMdJncatpZcAfHZoHlT1G6mqcY/view?usp=sharing

valyala commented 3 years ago

@morimoto-cybozu , thanks for the provided information!

I investigated which jobs are missing in targets.failed output compared to targets.succeded output with the following commands:

cat targets.failed | jq '.data.activeTargets[].labels.job' | sort > targets.failed.jobs
cat targets.succeeded | jq '.data.activeTargets[].labels.job' | sort > targets.succeeded.jobs
diff -u targets.succeeded.jobs targets.failed.jobs 
--- targets.succeeded.jobs  2021-05-14 13:55:28.516428137 +0300
+++ targets.failed.jobs 2021-05-14 13:55:17.704519665 +0300
@@ -167,21 +167,3 @@
 "vmagent-largeset"
 "vmagent-largeset"
 "vmagent-smallset"
-"vmalert-largeset"
-"vmalert-largeset"
-"vmalert-largeset"
-"vmalertmanager-largeset"
-"vmalertmanager-largeset"
-"vmalertmanager-largeset"
-"vmalertmanager-smallset"
-"vmalert-smallset"
-"vminsert-largeset"
-"vminsert-largeset"
-"vminsert-largeset"
-"vmselect-largeset"
-"vmselect-largeset"
-"vmselect-largeset"
-"vmsingle-smallset"
-"vmstorage-largeset"
-"vmstorage-largeset"
-"vmstorage-largeset"

It looks like the targets.failed has no only VictoriaMetrics-related targets. Probably these targets weren't registered in Kubernetes for some reason. This doesn't look like an issue in service discovery at vmagent side.

valyala commented 3 years ago

Another findings for targets dropped due to relabeling rules:

cat targets.succeeded | jq '.data.droppedTargets[].discoveredLabels.job' | sort | uniq -c > jobs.dropped.succeeded
cat targets.failed | jq '.data.droppedTargets[].discoveredLabels.job' | sort | uniq -c > jobs.dropped.failed
diff -u jobs.dropped.succeeded jobs.dropped.failed 
--- jobs.dropped.succeeded  2021-05-14 16:24:30.115907198 +0300
+++ jobs.dropped.failed 2021-05-14 16:24:37.563813337 +0300
@@ -4,18 +4,10 @@
      57 "monitoring/contour/0"
      44 "monitoring/node/0"
      44 "monitoring/node/1"
-     31 "monitoring/pushgateway/0"
+     35 "monitoring/pushgateway/0"
      41 "monitoring/rook/0"
      51 "monitoring/sabakan/0"
      51 "monitoring/vault/0"
-     31 "monitoring/victoriametrics-operator/0"
+     36 "monitoring/victoriametrics-operator/0"
      51 "monitoring/vmagent-largeset/0"
      53 "monitoring/vmagent-smallset/0"
-     51 "monitoring/vmalert-largeset/0"
-     51 "monitoring/vmalertmanager-largeset/0"
-     53 "monitoring/vmalertmanager-smallset/0"
-     53 "monitoring/vmalert-smallset/0"
-     51 "monitoring/vminsert-largeset/0"
-     51 "monitoring/vmselect-largeset/0"
-     53 "monitoring/vmsingle-smallset/0"
-     51 "monitoring/vmstorage-largeset/0"

It looks like some jobs defined in the scrape configs didn't receive any updates during failed service discovery, while other jobs such as monitoring/victoriametrics-operator/0 or monitoring/pushgateway/0 receive only partial updates. This case needs additional investigaton.

valyala commented 3 years ago

@morimoto-cybozu , could you run Prometheus with the same scrape configs and verify whether it always discovers the same set of scrape targets as vmagent in targets.succeeded case?

valyala commented 3 years ago

Could you also try running vmagent from the commit 2eed410466d335598415429a57567ab6a7ac6d1d and verify whether it fixes the issue with service discovery?

Could you also share graphs for the following metrics collected from vmagent (see these docs about vmagent metrics) during succeeded and failed service discovery:

morimoto-cybozu commented 3 years ago

@valyala I'm trying to make an experiment again, and I'm sorry to say that it will take a few days.

One thing I want to point out is that the scraping targets were detected successfully in the failed environment only by restarting vmagent Pods. This would mean that the targets had been already registered in the K8s API server correctly in that environment.

valyala commented 3 years ago

Thanks for additional details!

It looks like vmagent didn't re-read the -promscrape.config file on the first SIGHUP signal according to logs from https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1240#issuecomment-840977886 . This could result in incomplete list of scrape jobs if the initial config file had no some VictoriaMetrics-related jobs. This should be fixed in the commit c54bb73867ae0024b5acb0bcaa611a129b012abc . Could you verify whether vmagent built from this commit correctly discovers scrape targets on the first run?

f41gh7 commented 3 years ago

Also, if solution above didnt help. I've made build with logging of vmagent discovery, it must help identify source of issue with missing target labels. May i ask to run it and provide log file?

docker image: victoriametrics/vmagent:heads-vmagent-logging-0-g6b9080c03 , its built from branch https://github.com/VictoriaMetrics/VictoriaMetrics/tree/vmagent-discovery-logging

valyala commented 3 years ago

FYI, the bugfix related to missing SIGHUP handling has been added in vmagent v1.60.0.

morimoto-cybozu commented 3 years ago

@f41gh7 @valyala I executed my CI test 8 times with v1.60.0, and got 8 succeeds and 0 failures. Great! I really appreciate your consideration and support!

I've got the following 2 types of logs. In both cases, I found that the desired targets were detected.

Type 1: 'found changes in "/etc/vmagent/config_out/vmagent.env.yaml"'

...
2021-05-26T14:51:43.732Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378  started node watcher for "https://10.68.0.1:443/api/v1/nodes"
2021-05-26T14:51:43.785Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467  reloaded 7 objects from "https://10.68.0.1:443/api/v1/nodes"; updated=0, removed=0, added=7, resourceVersion="10672"
2021-05-26T14:51:43.786Z        info    VictoriaMetrics/lib/promscrape/config.go:69     started service discovery routines in 3.115 seconds
2021-05-26T14:51:43.868Z        info    VictoriaMetrics/lib/promscrape/scraper.go:359   kubernetes_sd_configs: added targets: 131, removed targets: 0; total targets: 131
2021-05-26T14:51:45.046Z        info    VictoriaMetrics/lib/promscrape/scraper.go:126   SIGHUP received; reloading Prometheus configs from "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:51:45.046Z        info    VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:148      SIGHUP received; reloading relabel configs pointed by -remoteWrite.relabelConfig and -remoteWrite.urlRelabelConfig
2021-05-26T14:51:45.046Z        info    VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:155      Successfully reloaded relabel configs
2021-05-26T14:51:45.072Z        info    VictoriaMetrics/lib/promscrape/scraper.go:133   nothing changed in "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:51:46.052Z        info    VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:148      SIGHUP received; reloading relabel configs pointed by -remoteWrite.relabelConfig and -remoteWrite.urlRelabelConfig
2021-05-26T14:51:46.052Z        info    VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:155      Successfully reloaded relabel configs
2021-05-26T14:51:46.052Z        info    VictoriaMetrics/lib/promscrape/scraper.go:126   SIGHUP received; reloading Prometheus configs from "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:51:46.071Z        info    VictoriaMetrics/lib/promscrape/config.go:74     stopping service discovery routines...
2021-05-26T14:51:46.071Z        info    VictoriaMetrics/lib/promscrape/config.go:78     stopped service discovery routines in 0.000 seconds
2021-05-26T14:51:46.071Z        info    VictoriaMetrics/lib/promscrape/config.go:65     starting service discovery routines...
2021-05-26T14:51:46.076Z        info    VictoriaMetrics/lib/promscrape/config.go:69     started service discovery routines in 0.005 seconds
2021-05-26T14:51:46.077Z        info    VictoriaMetrics/lib/promscrape/scraper.go:162   found changes in "/etc/vmagent/config_out/vmagent.env.yaml"; applying these changes
2021-05-26T14:51:46.143Z        info    VictoriaMetrics/lib/promscrape/scraper.go:359   kubernetes_sd_configs: added targets: 1, removed targets: 0; total targets: 132
...

Type 2: 'nothing changed in "/etc/vmagent/config_out/vmagent.env.yaml"'

...
2021-05-26T14:52:56.960Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378  started pod watcher for "https://10.68.0.1:443/api/v1/namespaces/external-dns/pods"
2021-05-26T14:52:57.060Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467  reloaded 0 objects from "https://10.68.0.1:443/api/v1/namespaces/external-dns/pods"; updated=0, removed=0, added=0, resourceVersion="10438"
2021-05-26T14:52:57.061Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378  started pod watcher for "https://10.68.0.1:443/api/v1/namespaces/logging/pods"
2021-05-26T14:52:57.163Z        info    VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:148      SIGHUP received; reloading relabel configs pointed by -remoteWrite.relabelConfig and -remoteWrite.urlRelabelConfig
2021-05-26T14:52:57.163Z        info    VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:155      Successfully reloaded relabel configs
2021-05-26T14:52:57.298Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467  reloaded 36 objects from "https://10.68.0.1:443/api/v1/namespaces/logging/pods"; updated=0, removed=0, added=36, resourceVersion="10442"
2021-05-26T14:52:57.299Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378  started pod watcher for "https://10.68.0.1:443/api/v1/namespaces/metallb-system/pods"
2021-05-26T14:52:57.393Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467  reloaded 5 objects from "https://10.68.0.1:443/api/v1/namespaces/metallb-system/pods"; updated=0, removed=0, added=5, resourceVersion="10444"
...
2021-05-26T14:52:57.590Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378  started node watcher for "https://10.68.0.1:443/api/v1/nodes"
2021-05-26T14:52:57.647Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467  reloaded 7 objects from "https://10.68.0.1:443/api/v1/nodes"; updated=0, removed=0, added=7, resourceVersion="10448"
2021-05-26T14:52:57.648Z        info    VictoriaMetrics/lib/promscrape/config.go:69     started service discovery routines in 5.893 seconds
2021-05-26T14:52:57.648Z        info    VictoriaMetrics/lib/promscrape/scraper.go:126   SIGHUP received; reloading Prometheus configs from "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:52:57.702Z        info    VictoriaMetrics/lib/promscrape/scraper.go:133   nothing changed in "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:52:57.755Z        info    VictoriaMetrics/lib/promscrape/scraper.go:359   kubernetes_sd_configs: added targets: 129, removed targets: 0; total targets: 129
...
valyala commented 3 years ago

Thanks for good news! Then closing the issue as fixed.