VictoriaMetrics / operator

Kubernetes operator for Victoria Metrics
Apache License 2.0
403 stars 141 forks source link

Broken vmagent config in case of using seriesLimit in VMNodeScrape for kubelet #986

Closed dglushenok closed 1 day ago

dglushenok commented 2 weeks ago

Hello.

I'm using victoria-metrics-k8s-stack version 0.23.2, which is bundled with operator version v0.45.0.

In case of specifying seriesLimit for kubelet in values.yaml of victoria-metrics-k8s-stack like this:

kubelet:
  enabled: true

  # -- Enable scraping /metrics/cadvisor from kubelet's service
  cadvisor: true
  # -- Enable scraping /metrics/probes from kubelet's service
  probes: true
  # spec for VMNodeScrape crd
  # https://docs.victoriametrics.com/operator/api.html#vmnodescrapespec
  spec:
    scheme: "https"
    honorLabels: true
    interval: "30s"
    scrapeTimeout: "5s"
    tlsConfig:
      insecureSkipVerify: true
      caFile: "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
    bearerTokenFile: "/var/run/secrets/kubernetes.io/serviceaccount/token"
    # drop high cardinality label and useless metrics for cadvisor and kubelet
    metricRelabelConfigs:
      - action: labeldrop
        regex: (uid)
      - action: labeldrop
        regex: (id|name)
      - action: drop
        source_labels: [__name__]
        regex: (rest_client_request_duration_seconds_bucket|rest_client_request_duration_seconds_sum|rest_client_request_duration_seconds_count)
    relabelConfigs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - sourceLabels: [__metrics_path__]
        targetLabel: metrics_path
      - targetLabel: "job"
        replacement: "kubelet"
    # ignore timestamps of cadvisor's metrics by default
    # more info here https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4697#issuecomment-1656540535
    honorTimestamps: false
    seriesLimit: 180000

vmagent starts to crash with following error:

2024-06-21T12:39:10.634Z        fatal   VictoriaMetrics/lib/promscrape/scraper.go:116   cannot read "/etc/vmagent/config_out/vmagent.env.yaml": cannot parse Prometheus config from "/etc/vmagent/config_out/vmagent.env.yaml": cannot unmarshal data: yaml: unmarshal errors:
  line 49861: field series_limit already set in type promscrape.ScrapeConfig
  line 49897: field series_limit already set in type promscrape.ScrapeConfig
  line 49934: field series_limit already set in type promscrape.ScrapeConfig; pass -promscrape.config.strictParse=false command-line flag for ignoring unknown fields in yaml config

/etc/vmagent/config_out/vmagent.env.yaml containes following sections with series_limit defined multiple times:

- job_name: nodeScrape/vm/vm-victoria-metrics-k8s-stack-cadvisor/0
  honor_labels: true
  honor_timestamps: false
  kubernetes_sd_configs:
  - role: node
  scrape_interval: 30s
  scrape_timeout: 5s
  metrics_path: /metrics/cadvisor
  series_limit: 180000
  scheme: https
  tls_config:
    insecure_skip_verify: true
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - source_labels:
    - __meta_kubernetes_node_name
    target_label: node
  - target_label: job
    replacement: vm/vm-victoria-metrics-k8s-stack-cadvisor
  - regex: __meta_kubernetes_node_label_(.+)
    action: labelmap
  - source_labels:
    - __metrics_path__
    target_label: metrics_path
  - target_label: job
    replacement: kubelet
  series_limit: 180000
  metric_relabel_configs:
  - regex: (uid)
    action: labeldrop
  - regex: (id|name)
    action: labeldrop
  - source_labels:
    - __name__
    regex: (rest_client_request_duration_seconds_bucket|rest_client_request_duration_seconds_sum|rest_client_request_duration_seconds_count)
    action: drop
- job_name: nodeScrape/vm/vm-victoria-metrics-k8s-stack-kubelet/1
  honor_labels: true
  honor_timestamps: false
  kubernetes_sd_configs:
  - role: node
  scrape_interval: 30s
  scrape_timeout: 5s
  series_limit: 180000
  scheme: https
  tls_config:
    insecure_skip_verify: true
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - source_labels:
    - __meta_kubernetes_node_name
    target_label: node
  - target_label: job
    replacement: vm/vm-victoria-metrics-k8s-stack-kubelet
  - regex: __meta_kubernetes_node_label_(.+)
    action: labelmap
  - source_labels:
    - __metrics_path__
    target_label: metrics_path
  - target_label: job
    replacement: kubelet
  series_limit: 180000
  metric_relabel_configs:
  - regex: (uid)
    action: labeldrop
  - regex: (id|name)
    action: labeldrop
  - source_labels:
    - __name__
    regex: (rest_client_request_duration_seconds_bucket|rest_client_request_duration_seconds_sum|rest_client_request_duration_seconds_count)
    action: drop
- job_name: nodeScrape/vm/vm-victoria-metrics-k8s-stack-probes/2
  honor_labels: true
  honor_timestamps: false
  kubernetes_sd_configs:
  - role: node
  scrape_interval: 30s
  scrape_timeout: 5s
  metrics_path: /metrics/probes
  series_limit: 180000
  scheme: https
  tls_config:
    insecure_skip_verify: true
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - source_labels:
    - __meta_kubernetes_node_name
    target_label: node
  - target_label: job
    replacement: vm/vm-victoria-metrics-k8s-stack-probes
  - regex: __meta_kubernetes_node_label_(.+)
    action: labelmap
  - source_labels:
    - __metrics_path__
    target_label: metrics_path
  - target_label: job
    replacement: kubelet
  series_limit: 180000
  metric_relabel_configs:
  - regex: (uid)
    action: labeldrop
  - regex: (id|name)
    action: labeldrop
  - source_labels:
    - __name__
    regex: (rest_client_request_duration_seconds_bucket|rest_client_request_duration_seconds_sum|rest_client_request_duration_seconds_count)
    action: drop

This looks like a bug.

Haleygo commented 1 day ago

The fix was included in v0.46.1, close as completed. Feel free to reopen if there are further questions.