Closed laixintao closed 2 months ago
vmagent_remotewrite_conn_bytes_written_total
seems even less than before
- vm remote-write protocol with zstd compression is introduced in v1.88.0.
- Since v1.88.0,
vmagent
send a handshake request tovminsert
at start-up phase if no protocol is specified via command-line flags.vmagents remote write speed decrease ...
I guess the issue might occur on the remote-write target(s). It would be helpful to have:
some screenshots of the remote-write target(s) status. I have seen a similar case happens on vmagent
when our vmstorage
cannot support large amounts of data and Slow inserts went up to ~80%.
# I recommand finding those trouble shooting query on https://grafana.com/orgs/victoriametrics/dashboards
max(
rate(vm_slow_row_inserts_total{job=~"$job_storage"}[$__rate_interval])
/ rate(vm_rows_added_to_storage_total{job=~"$job_storage"}[$__rate_interval])
)
(Since it recovered quickly,) Logs from vmstorage
(and possibly vminsert
) to see if something goes wrong.
Thank you so much for your information.
vm remote-write protocol with zstd compression is introduced in v1.88.0.
that explains the bandwidth reduce, however, can I confirm that, the -remoteWrite.rateLimit
is for limiting the rate after compress, right?
and seems it's not slow inserts.
The metrics this cluster collected are stable, as far as I know, vmagent only query when insert the metrics for first time, so if no huge change, it should be any slow query and slow inserts.
Also no logs found from vminsert and vmsorage.
From metrics of vmstorage, seems there is nothing wrong, only the source reduce the ingestion speed.
(I don't think it's vmstorage's issue, becuase 3 out of 6 vmagents got issue, if it's vmstorage's issue, all of vmagents should have trouble sending data base.)
Thanks again for you information!
the
-remoteWrite.rateLimit
is for limiting the rate after compress, right?
I don't think it's vmstorage's issue, becuase 3 out of 6 vmagents got issue, if it's vmstorage's issue, all of vmagents should have trouble sending data base.
This makes sense and the slow insert looks absolutely fine.
I'm not able to locate the root cause for you right now. If both vmstorage
and vminsert
are fine, then some metrics from vmagent
may help.
Since it's not reproducible, I recommend checking the dashboard of vmagent
to see if any issue happens for the scrape targets. For example, some scrape targets are down so some of your vmagent
failed to retrieve the metrics (at different times).
Thanks for info.
Since it's not reproducible, I recommend checking the dashboard of vmagent to see if any issue happens for the scrape targets. For example, some scrape targets are down so some of your vmagent failed to retrieve the metrics (at different times).
Targets should be ok, as the scape rows are not changed, and vmagent local disk pending data was increasing, suggesting that the metrics were scraped, but can not send to remote.
It happened today again. I suspect that is a vmagent issue, because after restarting, vmagent behaves ok.
some abnormal vmagent panels I have noticed:
Push delay increased
target's unique labels changed, but I suspect that it's not true, as I inpsect the target's /metrics
path, seems nothing changed at that time.
After the unique samples decrease, vmagent didn't recover, it kept pending data at local, and push delay remains high, after restarting, everything became normal.
@laixintao Thank you for the extra monitoring metrics. Did you see some req rate/traffic changes in remote-write panels? e.g.
# the same promql as you mentioned in the issue
sum(rate(vmagent_remotewrite_conn_bytes_written_total{job=~"$job", instance=~"$instance"}[$__rate_interval])) by(job, pod) > 0
# check the response status code with this one
sum(rate(vmagent_remotewrite_requests_total{job=~"$job", instance=~"$instance", url=~"$url"}[$__rate_interval])) by (job, url, status_code, pod) > 0
After the unique samples decrease, vmagent didn't recover,
I would like to share my thoughts here. The first direction I am considering is whether there are some limitations on your network, such as blocking all requests larger than a certain size (e.g., xx MiB). In such cases, vmagent might encounter failures in sending these (big) requests and continue buffering and retrying them. In this scenario, reducing the number of unique samples won't address the issue of retrying requests.
In this case, you should have some error logs from vmagent
, as well as some abnormal metrics via the PromQLs above.
after restarting, everything became normal.
May I confirm with you how vmagent
is deployed (e.g. what flag
is used especially those who are related to persistent)? Is it a StatefulSet
or Deployment
? Will it load the persistent queue after a restart? If it's deployed as a Deployment, it could lose the retry queue, so everything might appear to be back to normal.
for the second metric, same.
For netowrking, I think it's fine, they are in same IDC and vmagent -> vminsert (same server with vmagent) -> vmstorage, there is no middle proxy, so it's pretty simple. I ahve checked logs, still no logs of vmagent, only some errors requesting http sd, but it should not be a problem, as vmagent should use the targets from last success sd.
Is it a StatefulSet or Deployment? Sorry I am not sure what this is, they are deployed on baremetal server.
Will it load the persistent queue after a restart? Yes, all cached data was loaded and sent to vmstorage, no data loss.
I have upgraded those vmagents to v1.101 (latest) see if it still had this issue or not.
Thanks for more info.
Sorry did not help with this issur. In case Im going to a wrong direction, it would be appreciated if we could have some input from maintainers @f41gh7 :) thanks
BUGFIX: downgrade Go builder from 1.22.0 to 1.21.7, since 1.22.0 contains the bug, which can lead to deadlocked HTTP
I think this was exactly the issue, since vmagent communicates via HTTP to vminserts. If connections get deadlocked one by one, you'd see gradual ingestion delay. From the vminsert perspective it should look like number of active TCP connections deacreases with time.
I recommend updating to the latest LTS https://docs.victoriametrics.com/changelog/#v19314 or to upstream versions.
Thanks for confirmation! I agree this is exactly the issue!
Version changes and remote_write_connections:
Thanks!
(btw I think we need to add warning in the changelog of 1.93.12 here https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.93.12 , cc @valyala )
Describe the bug
Yesterday, I received the alert "RemoteWriteConnectionIsSaturated", suggesting that the data vmagent scrape is larger than its sending speed. So I change:
-remoteWrite.rateLimit=50000000
to-remoteWrite.rateLimit=80000000
, at point A of this picture, also I have upgrade vmagent from 1.82.1 to 1.93.12.Problems solved. (But from the monitoring,
vmagent_remotewrite_conn_bytes_written_total
seems even less than before. Is it becauseVictoriaMetrics remote write protocol
enabled by default in new version?)Then at point B, issue occurred:
At point C, I update the config again
-remoteWrite.rateLimit=100000000
and restart vmagent, problem solved.To Reproduce
It happened once this morning, so I can not reproduce.
Version
1.93.12
Logs
No errors from vmagent stdout.
Screenshots
No response
Used command-line flags
No response
Additional information
I have searched the release log from 1.82.1 -> 1.93.12, I didn't see any obvious bugfix related to that, only in 1.93.13: