VictoriaMetrics / VictoriaMetrics

VictoriaMetrics: fast, cost-effective monitoring solution and time series database
https://victoriametrics.com/
Apache License 2.0
12.19k stars 1.21k forks source link

vmstorage: spike in churn rate leads to increased IO and system slowdown #7229

Open adgarr opened 2 weeks ago

adgarr commented 2 weeks ago

Is your question request related to a specific component?

victormetrics

Describe the question in detail

vmstorage-prod --version vmstorage-20240201-152249-tags-v1.93.11-cluster-0-g891a9f951 -search.maxUniqueTimeseries=4000000

When a high volume of queries causes increased read I/O on storage nodes, leading to a surge in slow inserts,How to solve this problem?

image image image image image image image image image image image

Troubleshooting docs

zekker6 commented 2 weeks ago

Hello @adgarr Based on the graphs you've provided it looks like there is a spike in churn rate around 09:30. Spike in churn rate requires a TSDB to register a large set of new time series. Series registration requires performing an index scan to create a unique identifier of the metric. In this case, IO rate is increased due to high rate of index scans required during churn rate spike.

There are a few options to mitigate this:

adgarr commented 1 week ago

Hello @adgarr Based on the graphs you've provided it looks like there is a spike in churn rate around 09:30. Spike in churn rate requires a TSDB to register a large set of new time series. Series registration requires performing an index scan to create a unique identifier of the metric. In this case, IO rate is increased due to high rate of index scans required during churn rate spike.

There are a few options to mitigate this:

  • increase amount of memory available for vmstorage nodes - this will allow using more memory for caching of data blocks after fetching data from disk and thus reducing the IO impact.
  • updating to a version after v1.97.0 release as v1.97.0 included an improvement which improves series registration speed:

FEATURE: improve new time series registration speed on systems with high number of CPU cores. Thanks to @misutoth for the initial idea and implementation.

Hello,@zekker6 I have 12 storage nodes, and each node has 256GB of memory.

The relevant configurations for the storage service are as follows: vmstorage-prod \ -storageDataPath=/disk/ssd/victoria-metrics/data \ -logNewSeries \ -retentionPeriod=12 \ -search.maxTagKeys=5000000 \ -search.maxTagValues=5000000 \ -search.maxUniqueTimeseries=4000000 \ -memory.allowedBytes=110GB \ -storage.cacheSizeIndexDBDataBlocks=25GB \ -storage.cacheSizeIndexDBIndexBlocks=15GB

The memory usage rate is around 50%. image

So, Which parameter configuration should I increase for vmstorage nodes? cacheSizeIndexDBDataBlocks or cacheSizeIndexDBIndexBlocks?

But the official recommendation is to keep memory usage around 50%, just like mine currently is. Should I adjust the memory size?

zekker6 commented 1 week ago

@adgarr Do you have any other software running on the same nodes as vmstorage? If vmstorage is the only application running there than it would be safe to remove memory limit set by -memory.allowedBytes=110GB and also remove custom cache size configuration.

This will allow vmstorage to more memory for caching, and it will still be safe as vmstorage limits amount of memory used for caches as 60% of overall memory by default.

By default, cache size for -storage.cacheSizeIndexDBDataBlocks is calculated as 25% of allowed memory(60% of overall node memory), which would be ~38.4Gb with these limits being removed. That should help to handle spikes in churn rate.

But the official recommendation is to keep memory usage around 50%, just like mine currently is. Should I adjust the memory size?

For large vmstorage nodes it is safe to target higher memory utilization as there is still a lot of headroom to accommodate sudden spikes in ingestion rate. You can safely target 60-70% memory usage, but you need to have alerts in place to notify you in case memory usage will grow. Please, see these alerting rules as a good starting point for cluster monitoring.

adgarr commented 1 week ago

@zekker6 The storage node is not running any other applications. I will modify the configuration a bit and update the version, then check the results. Thank you again!

adgarr commented 1 week ago

@zekker6 I updated the application from version 93 to version 103 in the testing environment around 3 PM and found that the performance seems to have worsened. The number of pending datapoints has increased, and the merge seed has decreased, as shown in the image below.

image

zekker6 commented 1 week ago

@adgarr these changes are expected after the upgrade. Starting from v1.97.0 new entries in IndexDB are written in background, these datapoints are counted as pending. This change also allows to reduce number of samples to be merged as it is possible to buffer data more effectively. In general, in order to check the performance it is better to check for CPU, memory, disk usage and latencies for requests processing. Changes in metrics like pending datapoints might be related to internal changes and might be misleading in some cases. Also, it seems like you're using quite old version of the monitoring dashboard. It might be better to get the latest version as it covers more metrics and has the most up-to-date information about metrics and charts.

adgarr commented 1 week ago

@zekker6 Hello, I updated the environment from version 93 to version 103 in the production environment last night, removed the cache configuration for the storage nodes, and updated the monitoring panel. However, I found that after the update, there were frequent occurrences of a large number of slow inserts, as shown in the image below: image image image image image image image image image

How can I avoid this situation?

zekker6 commented 1 week ago

@adgarr Based on the graphs, it seems like slow insert spikes are caused by re-routing of time series between storage nodes. Could you check if storage nods were restarted at that time? Also, could you share a list of command-line flags used at vminsert?

adgarr commented 1 week ago

@zekker6 The storage nodes have not been restarted, but multiple storage nodes have been rerouted. I guess it might be due to the high CPU usage of these storage nodes, which led to them being routed to other nodes, but I'm not sure.

The following is the configuration of my storage nodes: vmstorage-prod \ -storageDataPath=/disk/ssd/victoria-metrics/data \ -logNewSeries \ -retentionPeriod=12 \ -search.maxTagKeys=5000000 \ -search.maxTagValues=5000000 \ -search.maxUniqueTimeseries=4000000 \ -http.maxGracefulShutdownDuration=30s \ -memory.allowedPercent=65

The following is the configuration of my insert nodes: vminsert-prod \ -storageNode=xxx1:8400 \ -storageNode=xxx2:8400 \ -storageNode=xxx3:8400 \ -storageNode=xxx4:8400 \ -storageNode=xxx5:8400 \ -storageNode=xxx6:8400 \ -storageNode=xxx7:8400 \ -storageNode=xxx8:8400 \ -storageNode=xxx9:8400 \ -storageNode=xxx10:8400 \ -storageNode=xxx11:8400 \ -storageNode=xxx12:8400 \ -maxLabelValueLen=500 \ -replicationFactor=2 \ -disableRerouting=false \ -insert.maxQueueDuration=10m0s \ -maxConcurrentInserts=30000 \ -maxInsertRequestSize=335544320 \ -maxLabelsPerTimeseries=300

The storage service is functioning normally, but the rerouter is triggered occasionally. The CPU usage of a specific storage node is very high when the rerouter is active. At this time, the CPU's iowait is normal, the read/write latency of IO is very low, and the IOPS is also very low. What could be the reason for the high CPU usage? image image image image image image image

zekker6 commented 1 week ago

@adgarr Thank you for the details. Could you also share a panel named "Storage reachability" at the same time when rerouting happens? It seems like vminsert treats node as unavailable and that triggers rerouting of series from this storage node to other ones. Rerouting then triggers new series registration and spike in slow inserts.

It would be great if you could gather a CPU profile in order to understand what is happening on a storage node during these spikes. Please, see doc for the details about how to gather CPU profile - https://docs.victoriametrics.com/cluster-victoriametrics/#profiling

adgarr commented 4 days ago

@zekker6

Today, I encountered this type of issue where the CPU usage on this node spiked, and it rerouted to other nodes. The storage service on this node is functioning normally.

cpu.tar.gz

image image image image image

zekker6 commented 4 days ago

@adgarr It seems like this is an expected behavior during churn rate spikes. vminsert detects that some vmstorage nodes are overloaded as these nodes are accepting new datapoints at slower rate and re-routes data from these storage nodes to other available ones. This is useful in order to avoid slowing down overall ingestion speed.

You can disable this behavior by setting -disableRerouting=true:

-disableRerouting Whether to disable re-routing when some of vmstorage nodes accept incoming data at slower speed compared to other storage nodes. Disabled re-routing limits the ingestion rate by the slowest vmstorage node. On the other side, disabled re-routing minimizes the number of active time series in the cluster during rolling restarts and during spikes in series churn rate. See also -disableReroutingOnUnavailable and -dropSamplesOnOverload (default true)

As a drawback ingestion speed will be limited by the speed of the slowest storage node.

adgarr commented 4 days ago

@zekker6 What causes a particular node to suddenly overload? I prefer not to disable reroute. Each storage node holds 8TB of data; could it be that querying this node requires retrieving a lot of data, leading to the overload? However, I noticed that the disk I/O on that storage node does not change significantly. If I add more storage nodes and reduce the amount of data stored on each node, would that be better? Would having more storage nodes significantly affect querying and writing data?