VictoriaMetrics / VictoriaMetrics

VictoriaMetrics: fast, cost-effective monitoring solution and time series database
https://victoriametrics.com/
Apache License 2.0
12.02k stars 1.19k forks source link

vmagent: error while pushing to thanos-receiver #3875

Open hagen1778 opened 1 year ago

hagen1778 commented 1 year ago

Describe the bug

While using vmagent to push metrics to thanos-receiver, the latter outputs the following error:

{"caller":"writer.go:167","component":"receive-writer","level":"warn","msg":"Error on series with out-of-order labels","numDropped":1325,"tenant":"default-tenant","ts":"2023-02-24T11:00:54.25131079Z"}

To Reproduce

"I tried to send metrics from VMAgent to Thanos Receive as I wanna do some performance benchmarking between usuing VMAgent and Prometheus. In Thanos Receive I get the following error messages, when trying to send metrics from VMAgent via remote write URL"

Additional information

The labels sort requirement was enforced by Thanos receiver here. Prometheus doesn't have this enforced yet.

vmagent doesn't sort labels by default, but this behavior can be changed by passing -sortLabels command-line flag:

 -sortLabels
     Whether to sort labels for incoming samples before writing them to all the configured remote storage systems. This may be needed for reducing memory usage at remote storage when the order of labels in incoming samples is random. For example, if m{k1="v1",k2="v2"} may be sent as m{k2="v2",k1="v1"}Enabled sorting for labels can slow down ingestion performance a bit

The flag was introduced in 1.58:

FEATURE: vminsert and vmagent: add -sortLabels command-line flag for sorting metric labels before pushing them to vmstorage. This should reduce the size of MetricName -> internal_series_id cache (aka vm_cache_size_bytes{type="storage/tsid"}) when ingesting samples for the same time series with distinct order of labels. For example, foo{k1="v1",k2="v2"} and foo{k2="v2",k1="v1"} represent a single time series. Labels sorting is disabled by default, since the majority of established exporters preserve the order of labels for the exported metrics.
valyala commented 1 year ago

Additionally to out-of-order labels, Thanos, Cortex, Mimir and Prometheus may reject accepting samples from vmagent with out of order timestamps. This is because vmagent writes data to the configured remote storage via multiple concurrent connections. Samples for the same time series may be sent concurrently via multiple such connections. It is possible that the sample with newer timestamp is delivered faster than the sample with older timestamp. This will result in out of order samples error at Prometheus, Thanos, Cortex and Mimir. This can be fixed by running vmagent with -remoteWrite.queues=1 command-line flag. This flag instructs vmagent to use only a single connection to the configured -remoteWrite.url for sending the data. See vmagent troubleshooting for details.

carlosrmendes commented 9 months ago

I'm running the same issue, with the flag -sortLabels set... This issue is nothing to do with out of order samples @valyala , even because I have set --tsdb.out-of-order.time-window on thanos receiver.

This issue is regarding out of order labels which does not make sense with the -sortLabels flag set. I started to get the error Error on series with out-of-order labels on version v1.96.0, previously on version v1.89.1 the error did not occur and vmagent could write all the metrics to thanos receiver successfully

carlosrmendes commented 9 months ago

I figured it out that the --remoteWrite.label are now being added at the end of the metric labels, without being sorted with the original labels. Why this behavior change? It can be reverted?

carlosrmendes commented 9 months ago

on version v1.93.0, with the -sortLabels and -remoteWrite.label flags set is working as expected. On version v1.93.1 the out of order labels error start occurring, so the -sortLabels stopped to work as intended, as it started to not include the remoteWrite labels in the sort

hagen1778 commented 9 months ago

@Amper @valyala I think the issue is related to the following change https://github.com/VictoriaMetrics/VictoriaMetrics/commit/a27c2f37731986f4bf6738404bb6388b1f42ffde

Shall we sort labels once again if -remoteWrite.label was applied?

valyala commented 8 months ago

Shall we sort labels once again if -remoteWrite.label was applied?

It is better from performance PoV to sort labels only once just before sending them to the remote storage.

sourcehawk commented 5 months ago

I am experiencing this in v1.99 using docker-compose locally. What is the solution here?

vm-agent         | 2024-04-03T14:57:32.117Z error VictoriaMetrics/app/vmagent/remotewrite/client.go:444   sending a block with size 21432 bytes to "1:secret-url" was rejected (skipping the block): status code 409; response body: store locally for endpoint : add 702 series: out of order labels
thanos-receiver    | ts=2024-04-03T14:57:32.11231392Z caller=writer.go:238 level=info component=receive component=receive-writer tenant=default-tenant msg="Error on series with out-of-order labels" numDropped=702

vm agent + exporter configuration

version: '3.5'

services:
  node_exporter:
    image: quay.io/prometheus/node-exporter:v1.7.0
    container_name: node-exporter
    command:
      - --path.rootfs=/host
      # Metrics endpoint at http://localhost:9100/metrics
      - --web.listen-address=:9100
      # Enable systemd collector to monitor autossh, postgres etc.
      - --collector.systemd
      - --collector.systemd.unit-include=^(autossh|postgresql)
      # Enable wifi collector to monitor simcard etc.
      - --collector.wifi
    security_opt:
      # Required to access systemd
      - apparmor:unconfined
    network_mode: host
    pid: host
    restart: unless-stopped
    volumes:
      - /:/host:ro,rslave
      - /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket

  vm_agent:
    image: victoriametrics/vmagent:v1.99.0
    container_name: vm-agent
    command:
      - -promscrape.config=/vmagent.config.yml
      # This should be http://<edge-server-ip>:10908/api/v1/receive
      - -remoteWrite.url=http://localhost:10908/api/v1/receive
      # Set persistent volume to store data
      - -remoteWrite.tmpDataPath=/vmagent/data
      # Set unique labels for device
      - -remoteWrite.label=device=test_1
      # Needed for thanos to accept the data
      - -sortLabels
    network_mode: host
    restart: unless-stopped
    volumes:
      - ./vmagent.config.yml:/vmagent.config.yml:ro
      - ./vmagent/data:/vmagent/data

Thanos receiver config:

version: '3.5'

services:
  thanos_receiver:
    image: quay.io/thanos/thanos:v0.34.1
    container_name: thanos-receiver
    command: >
      receive
      --tsdb.path="/thanos/data"
      --tsdb.retention=14d
      --label=stage='"production"'
      --label=cluster='"staging"'
      --label=region='"eu-west-1"'
      --label=receive_replica='"0"'
      --grpc-address="0.0.0.0:10907"
      --http-address="0.0.0.0:10909"
      --remote-write.address="0.0.0.0:10908"
      --objstore.config-file="/bucket.yml"
    network_mode: host
    pid: host
    restart: unless-stopped
    volumes:
      - ./thanos/data:/thanos/data:rw
      - ./bucket.yml:/bucket.yml:ro
sourcehawk commented 5 months ago

Nvm, the solution is to not use the - -remoteWrite.label flag at all... Works fine with global.external_labels in promscrape config

scanfield-openai commented 3 days ago

Is this issue being worked on? I notice that https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5874 was just closed and not merged.

I am using the workaround in the promscrape config, but that's obviously brittle. Would you accept a PR here / do you know why https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5874 was closed?