vmagent's remotewrite speed decreases with time, but restores after restart

laixintao commented 2 months ago

Describe the bug

Yesterday, I received the alert "RemoteWriteConnectionIsSaturated", suggesting that the data vmagent scrape is larger than its sending speed. So I change: -remoteWrite.rateLimit=50000000 to -remoteWrite.rateLimit=80000000, at point A of this picture, also I have upgrade vmagent from 1.82.1 to 1.93.12.

Problems solved. (But from the monitoring, vmagent_remotewrite_conn_bytes_written_total seems even less than before. Is it because VictoriaMetrics remote write protocol enabled by default in new version?)

Then at point B, issue occurred:

vmagent remote write speed decrease
data pending at vmagent local disk, not sending to vminsert
but it didn't reach vmagent's remote write speed limit, which is 80000000
I have 6 vmagents, same config, 3 of them met the same issue at different time.

At point C, I update the config again -remoteWrite.rateLimit=100000000 and restart vmagent, problem solved.

To Reproduce

It happened once this morning, so I can not reproduce.

Version

1.93.12

Logs

No errors from vmagent stdout.

Screenshots

No response

Used command-line flags

No response

Additional information

I have searched the release log from 1.82.1 -> 1.93.12, I didn't see any obvious bugfix related to that, only in 1.93.13:

BUGFIX: downgrade Go builder from 1.22.0 to 1.21.7, since 1.22.0 contains the bug, which can lead to deadlocked HTTP connections to remote storage systems, scrape targets and service discovery endpoints at vmagent. This may result in incorrect service discovery, target scraping and failed sending samples to remote storage.

jiekun commented 2 months ago

vmagent_remotewrite_conn_bytes_written_total seems even less than before

vm remote-write protocol with zstd compression is introduced in v1.88.0.

Since v1.88.0, vmagent send a handshake request to vminsert at start-up phase if no protocol is specified via command-line flags.

vmagents remote write speed decrease ...

I guess the issue might occur on the remote-write target(s). It would be helpful to have:

some screenshots of the remote-write target(s) status. I have seen a similar case happens on vmagent when our vmstorage cannot support large amounts of data and Slow inserts went up to ~80%.

# I recommand finding those trouble shooting query on https://grafana.com/orgs/victoriametrics/dashboards
max(
rate(vm_slow_row_inserts_total{job=~"$job_storage"}[$__rate_interval]) 
/ rate(vm_rows_added_to_storage_total{job=~"$job_storage"}[$__rate_interval])
)

(Since it recovered quickly,) Logs from vmstorage (and possibly vminsert) to see if something goes wrong.

laixintao commented 2 months ago

Thank you so much for your information.

vm remote-write protocol with zstd compression is introduced in v1.88.0.

that explains the bandwidth reduce, however, can I confirm that, the -remoteWrite.rateLimit is for limiting the rate after compress, right?

and seems it's not slow inserts.

The metrics this cluster collected are stable, as far as I know, vmagent only query when insert the metrics for first time, so if no huge change, it should be any slow query and slow inserts.

Also no logs found from vminsert and vmsorage.

From metrics of vmstorage, seems there is nothing wrong, only the source reduce the ingestion speed.

(I don't think it's vmstorage's issue, becuase 3 out of 6 vmagents got issue, if it's vmstorage's issue, all of vmagents should have trouble sending data base.)

Thanks again for you information!

jiekun commented 2 months ago

the -remoteWrite.rateLimit is for limiting the rate after compress, right?

Correct. https://github.com/VictoriaMetrics/VictoriaMetrics/blob/a8d0c1a62dcf69f59356109c2d77a67308d9ac0a/app/vmagent/remotewrite/client.go#L394

I don't think it's vmstorage's issue, becuase 3 out of 6 vmagents got issue, if it's vmstorage's issue, all of vmagents should have trouble sending data base.

This makes sense and the slow insert looks absolutely fine.

I'm not able to locate the root cause for you right now. If both vmstorage and vminsert are fine, then some metrics from vmagent may help.

Since it's not reproducible, I recommend checking the dashboard of vmagent to see if any issue happens for the scrape targets. For example, some scrape targets are down so some of your vmagent failed to retrieve the metrics (at different times).

laixintao commented 2 months ago

Thanks for info.

Since it's not reproducible, I recommend checking the dashboard of vmagent to see if any issue happens for the scrape targets. For example, some scrape targets are down so some of your vmagent failed to retrieve the metrics (at different times).

Targets should be ok, as the scape rows are not changed, and vmagent local disk pending data was increasing, suggesting that the metrics were scraped, but can not send to remote.

laixintao commented 2 months ago

It happened today again. I suspect that is a vmagent issue, because after restarting, vmagent behaves ok.

some abnormal vmagent panels I have noticed:

Push delay increased

target's unique labels changed, but I suspect that it's not true, as I inpsect the target's /metrics path, seems nothing changed at that time.

After the unique samples decrease, vmagent didn't recover, it kept pending data at local, and push delay remains high, after restarting, everything became normal.

jiekun commented 2 months ago

@laixintao Thank you for the extra monitoring metrics. Did you see some req rate/traffic changes in remote-write panels? e.g.

# the same promql as you mentioned in the issue
sum(rate(vmagent_remotewrite_conn_bytes_written_total{job=~"$job", instance=~"$instance"}[$__rate_interval])) by(job, pod) > 0

# check the response status code with this one
sum(rate(vmagent_remotewrite_requests_total{job=~"$job", instance=~"$instance", url=~"$url"}[$__rate_interval])) by (job, url, status_code, pod) > 0

After the unique samples decrease, vmagent didn't recover,

I would like to share my thoughts here. The first direction I am considering is whether there are some limitations on your network, such as blocking all requests larger than a certain size (e.g., xx MiB). In such cases, vmagent might encounter failures in sending these (big) requests and continue buffering and retrying them. In this scenario, reducing the number of unique samples won't address the issue of retrying requests.

In this case, you should have some error logs from vmagent, as well as some abnormal metrics via the PromQLs above.

after restarting, everything became normal.

May I confirm with you how vmagent is deployed (e.g. what flag is used especially those who are related to persistent)? Is it a StatefulSet or Deployment? Will it load the persistent queue after a restart? If it's deployed as a Deployment, it could lose the retry queue, so everything might appear to be back to normal.

laixintao commented 2 months ago

for the second metric, same.

For netowrking, I think it's fine, they are in same IDC and vmagent -> vminsert (same server with vmagent) -> vmstorage, there is no middle proxy, so it's pretty simple. I ahve checked logs, still no logs of vmagent, only some errors requesting http sd, but it should not be a problem, as vmagent should use the targets from last success sd.

Is it a StatefulSet or Deployment? Sorry I am not sure what this is, they are deployed on baremetal server.

Will it load the persistent queue after a restart? Yes, all cached data was loaded and sent to vmstorage, no data loss.

I have upgraded those vmagents to v1.101 (latest) see if it still had this issue or not.

jiekun commented 2 months ago

Thanks for more info.

Sorry did not help with this issur. In case Im going to a wrong direction, it would be appreciated if we could have some input from maintainers @f41gh7 :) thanks

hagen1778 commented 2 months ago

BUGFIX: downgrade Go builder from 1.22.0 to 1.21.7, since 1.22.0 contains the bug, which can lead to deadlocked HTTP

I think this was exactly the issue, since vmagent communicates via HTTP to vminserts. If connections get deadlocked one by one, you'd see gradual ingestion delay. From the vminsert perspective it should look like number of active TCP connections deacreases with time.

I recommend updating to the latest LTS https://docs.victoriametrics.com/changelog/#v19314 or to upstream versions.

laixintao commented 2 months ago

Thanks for confirmation! I agree this is exactly the issue!

it only happens on 1.93.12 version, before I use 1.89, no issue
I upgrade the cluster to 1.101 after it happen twice, it resolves the issue.

Version changes and remote_write_connections:

Thanks!

(btw I think we need to add warning in the changelog of 1.93.12 here https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.93.12 , cc @valyala )

laixintao commented 2 months ago

link to https://github.com/golang/go/issues/65705

VictoriaMetrics / VictoriaMetrics