VictoriaMetrics / VictoriaMetrics

VictoriaMetrics: fast, cost-effective monitoring solution and time series database
https://victoriametrics.com/
Apache License 2.0
12.25k stars 1.21k forks source link

Only one vmstorage node CPU suddenly becomes high in the cluster #3425

Closed petanne closed 9 months ago

petanne commented 1 year ago

Describe the bug Our cluster has 12 vmstorage (each 48 cores, 384GB memory and lots of free, 3% disk used), 4 vmselect, 4 vminsert, 4 vmagent, 1 vmalert. Only vmalert making read requests. But one vmstorage node CPU suddenly becomes high without human operation, neither insert requests nor search requests increased. When we stop the vmalert, it recovered. I have replaced the machine and double checked, it should not be a machine problem.

To Reproduce Start vmalert again, nothing happened. But i have another way to reproduce: Add a new vmalert and start it, one vmstorage CPU becomes high quickly, in about 1 minute.Then stop the problematic vmstorage, another vmstorage CPU becomes high quickly. When stop the second problematic vmstorage, another vmstorage will reproduce again.

Expected behavior vmstorage CPU running normally.

Logs no error or warning log in vmstorage.

Screenshots stats 20221201155703

vmstorage CPU 20221201155115

A test case, stop problematic vmstorage, looks like at least one vmstorage cpu must high? image

I think may be something wrong in vmstorage? Because it's insert and search requests is balanced and unanimously.

rate(vm_vminsert_metrics_read_total[1m]) 20221201153535

rate(vm_vmselect_rpc_requests_total[1m]) 20221201153925

vmstorage other metrics 20221201155422 20221201155848 20221201155836 20221201155824 20221201155811 20221201155753

problematic vmstorage go metric 20221201162108 20221201162218 20221201162142

pprof and trace when CPU high pprof_and_trace.zip

Version

vmstorage-20221014-073231-tags-v1.82.1-cluster-0-gb942dd77c

Used command-line flags

/app/vmstorage/vmstorage-prod -memory.allowedPercent=60 -search.maxUniqueTimeseries=30000000 -dedup.minScrapeInterval=30s -retentionPeriod=2y -storage.minFreeDiskSpaceBytes=10GB -storageDataPath=/data/vmstorage-data
hagen1778 commented 1 year ago

Hi @petanne! Thanks for detailed report!

Our cluster has 12 vmstorage (each 48 cores, 384GB memory and lots of free, 3% disk used)

How many vmstorage nodes are on the same instance? Are there other heavy processes running on the same instance? Could you please show RSS memory usage on problematic storage node and its memory limit?

max_over_time(process_resident_memory_bytes{}[10m])

vm_available_memory_bytes{}
petanne commented 1 year ago

Hi @hagen1778! Thanks for reply! One vmstorage process on each physical machine. There are no any other heavy process.

Red is problematic time range. image

Some metrics from node_exporter. image image

I have 10 nodes prometheus, different targets and alert rules on each prometheus, they have been running for more than a year. In VictoriaMetrics cluster, both targets and alert rules are same as prometheus.

Does some special metrics or query can cause this problem?

hagen1778 commented 1 year ago

Does some special metrics or query can cause this problem?

It could. But I'd expect it to happen only when vmstorage experiences memory shortage. Usually, such extensive and long CPU increase is a sign of page cache thrashing, which happens due to memory shortage or because some other process on the instance doing write/read heavy operations and polluting the cache. But according to your screenshots, vmstorage uses like 10% of available memory so this shouldn't be the case.

Do you have the following panels on your dashboard?

image

They might have a bit different names, but still can be used to show how heavy were queries at that time. If you don't have these panels, please find them here.

petanne commented 1 year ago

Do you have the following panels on your dashboard?

Red is problematic time range. image

Another test case, i try increase -search.maxConcurrentRequests for vmselect, problem still occurs. image

I found in pprof that the lib/storage.(*TSID).Less method is called more times than normal. What would cause this?

petanne commented 1 year ago

We try set -search.maxUniqueTimeseries for vmstorage to 100000, problem still occurs.

hagen1778 commented 1 year ago

I found in pprof that the lib/storage.(*TSID).Less method is called more times than normal. What would cause this?

More times or just takes more time? This method supposed to be in the top of CPU usage for all vmstorage nodes, since it is responsible for filtering data parts during lookups. The more read pressure is there the higher is impact on parts filtering.

Could you please run the following query for problematic storage?

max(
    vm_cache_size_bytes{job=~"$job_storage", instance=~"$instance"} 
    /
    vm_cache_size_max_bytes{job=~"$job", instance=~"$instance"}
) by(type)

It supposed to show cache usage % by type for the vmstorage node.

petanne commented 1 year ago

More times or just takes more time?

In problematic time range, lib/storage.(*TSID).Less is 30% flat in the problematic vmstorage, but other vmstorage is 5%.

This is cache usage % by type for the vmstorage node, 13.vmstorage is problematic node, 1.vmstorage CPU is normal. All of other vmstorage node are like 1.vmstorage.

max(vm_cache_size_bytes{job=~"$job", instance=~"$instance"} / vm_cache_size_max_bytes{job=~"$job", instance=~"$instance"}) by (instance, type)

image

But this query say vm_cache_size_bytes uses only a small amount of memory.

sum(vm_cache_size_bytes{job=~"$job", instance=~"$instance"}) by(type)
sum(vm_allowed_memory_bytes{job=~"$job", instance=~"$instance"})

image

I found in the source code, -storage.cacheSizeIndexDBTagFilters default value is int(float64(memory.Allowed()) / 32). Does sum(vm_cache_size_bytes) reaches the value of vm_cache_size_max_bytes is the cause of the problem?

But all of 12 vmstorage node tagFilters cache size is the same. It doesn't seem to explain why only one node is having problem.

vm_cache_size_bytes{type="indexdb/tagFilters"}

image

valyala commented 1 year ago

The commit 8e9822bc7f0f0591555be4faa76dd5af431e2000 optimizes the hostspot at TSID.Less() function. @petanne , could you try building vmstorage from this commit according to these instructions and verifying whether the TSID.Less() call is removed from CPU profile collected from the updated vmstorage? It is safe running vmstorage from this commit in the cluster with other v1.82.1 components.

petanne commented 1 year ago

@valyala, thanks! It worked!

The TSID.Less() flat in the CPU pprof is 3.87% now, it was 5% when normal before. image

Expecting for release. This time the sudden problem occurred in 2022-12-01 00:05 UTC, when the partition for this month is created. Could you tell me why and will it show up next month?

Victoria is great! Where can I learn partition and block? What happens during a search request in vmstorage?

petanne commented 1 year ago

@valyala @hagen1778 Bad news, problem happened again, and other problems arised. When we do stress tests and outage drills.

Version

vmstorage: 8e9822b vmagent: v1.84.0-cluster vminsert: v1.84.0-cluster vmselect: v1.84.0-cluster vmalert: v1.84.0-cluster

Timeline

Yesterday, Set -search.disableCache=true in vmselect, and change -evaluationInterval=30s to 10s in vmalert. it works fine until the next operation.

17:45 Reboot 13.vmstorage server. The 13.vmstorage has taken up 90% of the CPU usage since the process started, until we stoped all of 2 vmalert processes when 20 minutes after. This is question 1.

18:30 Start 2 vmalert processes. Everything is ok.

19:18 Reboot 1.vmstorage server. The 13.vmstorage CPU usage increase 90% immediately. This is question 2.

19:21 1.vmstorage process is auto started, but its CPU usage increase 90% immediately, and 13.vmstorage CPU usage returns to normal. This is question 3.

19:18 ~ 19:26, vmagent and vminsert network traffic is gradually drop to zero, then gradually recover, more serious. This is question 4.

19:32 Stop all vmalert processes. 1.vmstorage CPU usage is returns to normal. And vmagent and vminsert network traffic increased by 2 times. Looks like it's blocked by 1.vmstorage. This is question 5.

19:42 Change -evaluationInterval=10s to 30s then start 2 vmalert processes. The 13.vmstorage CPU usage is fluctuates frequently between 15% and 30%, other vmstorage nodes behave normally until 22:35. This is question 6.

22:35 Nobody does anything, 13.vmstorage CPU burst to 90%, as said at the beginning of this issue. This is question 7.

Dashboard

Red is 13.vmstorage. Green is 1.vmstorage. Blue is 2.vmstorage, normal example for comparison. image

Network traffic for one vmagent node. image

Expectation

All VM components are working smoothly, when any one machine goes down.


I read the official documentation and part of the source code, but still can't answer the above 7 questions. Maybe I'm about to lose victoria.

hagen1778 commented 1 year ago

This is question 1.

Does pprof shows the same issue?

Just for context, we discussed the issue with @valyala before he submitted a fix. The fix isn't fixing anything, it is a workaround for an issue we suspect happens in your setup. If you take a look at the profile you provided initially, you'd find that majority of time is spent on a simple comparison operation:

         .          .    104:func (t *TSID) Less(b *TSID) bool {
         .          .    105:   // Do not compare MetricIDs here as fast path for determining identical TSIDs,
         .          .    106:   // since identical TSIDs aren't passed here in hot paths.
   393.32s    399.16s    107:   if t.AccountID != b.AccountID {
         .          .    108:           return t.AccountID < b.AccountID
         .          .    109:   }

While this operation is supposed to be very lightweight, it takes too much time in your case. The assumption is that b *TSID doesn't fit the CPU cache, and the process takes extra time to extract it from memory. We don't know why it happens yet, it is a pretty rare situation across all the installations we had a chance to investigate. So far, we think that adding more storage shards could help to stabilize the situation. You can significantly reduce the allocated resource per-node in favor of using more instances with lower resources for vmstorage layer.

petanne commented 1 year ago

Does pprof shows the same issue?

Sorry for not keeping the scene and save pprof. But I have another pprof with the same problem for quesion 6. 13.cpu.pprof.12.6.zip image

I will try to get pprof when CPU 90% again.

Based on our capacity assessment and machine resources, we use 12 vmstorage server, each 48 cores, 384GB memory and 15TB disk. Although the machine configuration is high, but the data volume and usage rate are low now. I'm sorry there are no more machines to replace at this time.

I am not using multi-tenancy, each AccountID is default value or 0. Could this cause some unexpected behavior?

petanne commented 1 year ago

New test case

Only start one vmalert with -evaluationInterval=10s, 13.vmstorage CPU is 90%. storage.(*TSID).Less called from storage.(*partSearch).nextBHS is biggest CPU usage again.

13.error.cpu.pprof.12.9.zip

github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.(*partSearch).nextBHS
/root/VictoriaMetrics/lib/storage/tsid.go
  Total:     177.22s    178.12s (flat, cum) 14.76%
    107      169.88s    170.75s           if t.AccountID != b.AccountID {

Does nextBHS need optimization for Less?

func (ps *partSearch) nextBHS() bool {
    for len(ps.metaindex) > 0 {
        // Optimization: skip tsid values smaller than the minimum value
        // from ps.metaindex.
        for ps.BlockRef.bh.TSID.Less(&ps.metaindex[0].TSID) {
            if !ps.nextTSID() {
                return false
            }
        }

As a supplementary note, in the above case, the replication of series data on 13.vmstorage will be on the 1.vmstorage.

I want to known why are so many data blocks retrieved on 13.vmstorage? I checked all alert queries and found no suspicious.

hagen1778 commented 1 year ago

Based on our capacity assessment and machine resources, we use 12 vmstorage server, each 48 cores, 384GB memory and 15TB disk.

What I'm saying is that you can replace 48 cores machine with 3 machines 16 cores each and if the assumption with CPU cache is correct - the problem will be solved. But I understand it could be complicated when not running in cloud.

Could this cause some unexpected behavior?

No.

I want to known why are so many data blocks retrieved on 13.vmstorage?

It is unclear. We haven't observed such situations before. Even for a much bigger load and less hardware. Could you please share an example of the alert queries you run?

I'd also wait for @valyala response here, since he has more understanding of this. He's in US time zone and hopefully will be able to respond later.

petanne commented 1 year ago

Yes, we use physical machines, not cloud. Sorry, there are not enough machines to replace 48 cores with 3 * 16 cores.

Could you please share an example of the alert queries you run?

Here are some typical alert queries. They have been working in our Prometheus for many years.

up{port="9100"} == 0
sum(increase(log_monitor_counter{level="err"}[1m])) >= 10 
or 
log_monitor_counter{level="err"} >= 10 
    unless log_monitor_counter{level="err"} offset 1m 
    unless log_monitor_counter{level="err"} offset 10m
sum(rate(nginx_http_upstream_request_times_count{endpoint=~".*xxxx"}[1m])) / sum(rate(nginx_http_upstream_request_times_count{endpoint=~".*xxxx"}[1m] offset 1d)) < 0.8 
and 
sum(rate(nginx_http_upstream_request_times_count{endpoint=~".*xxxx"}[1m])) / sum(rate(nginx_http_upstream_request_times_count{endpoint=~".*xxxx"}[1m] offset 1w)) < 0.8
rate(is_machine_alive_change{}[3m]) < 0.002 and (hour() > 19  or hour() < 16)
hagen1778 commented 1 year ago

Set -search.disableCache=true in vmselect

Why would you do that? Disabling caches also increases the amount of work data blocks scanning. Normally, some previously read blocks are cached in memory to reduce the lookups.

17:45 Reboot 13.vmstorage server. The 13.vmstorage has taken up 90% of the CPU usage since the process started, until we stoped all of 2 vmalert processes when 20 minutes after. This is question 1.

It could be that after the restart caches for 13.vmstorage didn't fill up caches enough to remain efficient. It usually takes not that much time, but here we assume that load on the node is too high (due to unknown reason so far).

19:18 Reboot 1.vmstorage server. The 13.vmstorage CPU usage increase 90% immediately. This is question 2.

When one storage goes off, vminserts start to re-route data from the "dead" storage to other "healthy" storages. Re-routing always increases the load on the rest of the nodes. And if our theory of vmstorage-running-on-the-edge is correct, then this additional load only increases the probability of entering this state.

19:21 1.vmstorage process is auto started, but its CPU usage increase 90% immediately, and 13.vmstorage CPU usage returns to normal. This is question 3.

It is likely the same situation as in q1 - restoring the caches under read queries load. 13.vmstorage stopped receiving additional load from rerouting and got back to normal.

19:18 ~ 19:26, vmagent and vminsert network traffic is gradually drop to zero, then gradually recover, more serious. This is question 4.

There are two types of re-routing in vm cluster:

  1. when vmstorage becomes unreachable (goes off or network split)
  2. when vmstorage becomes slow (performance degradation) - this type of rerouting is disabled by default. We recommend enabling it if cluster contains > 10 storage nodes.

In your case, re-rerouting p2 was disabled. So when one vmstorage became slow - the ingestion speed was limited by the slowest vmstorage node in the cluster. You need to enable re-routing by setting -disableRerouting=false on vminsert nodes.

19:32 Stop all vmalert processes. 1.vmstorage CPU usage is returns to normal. And vmagent and vminsert network traffic increased by 2 times. Looks like it's blocked by 1.vmstorage. This is question 5.

Read load was reduced, so 1.vmstorage got back to normal for processing incoming data.

19:42 Change -evaluationInterval=10s to 30s then start 2 vmalert processes. The 13.vmstorage CPU usage is fluctuates frequently between 15% and 30%, other vmstorage nodes behave normally until 22:35. This is question 6. 22:35 Nobody does anything, 13.vmstorage CPU burst to 90%, as said at the beginning of this issue. This is question 7.

Back to my theory, I assume vmstorage nodes in this setup for some reason run-on-the-edge in a matter of compute resources. We have seen similar behaviour but only for setups where memory, cpu or disk were very close to 90-99% of utilization. Your case is different. According to the provided data, there are plenty of non-utilized resources. So the only assumption so far is the CPU cache issues. When read load crosses the "edge" - vmstorage falls into vicious circle where normal operations start to take too much time. This is when it enters into 90% CPU utilization.

Let's try to verify this theory by checking the CPU cache miss ratio in normal operation and faulty operation. Go processes do not have access to this info, so I'd like to ask you to take the measurements on your own. Please, take a look at this discussion. This comment provides an example of how to measure the misses in L3 cache - would be great if you can make the measurements before and during the incident. Please note the sleep command in the end, it is likely an interval of the measurement. It would be better to extend it during measurements.

Yes, we use physical machines, not cloud. Sorry, there are not enough machines to replace 48 cores with 3 * 16 cores.

Understandable. If that's an option, you can try using the cloud for running the test once again, but on the cloud's hardware. You can rule out the hardware issue or try different topologies by using cloud, even if temporary.

As I said earlier, we haven't observed the issue like yours before. So if you're using some kind of benchmark for generating the ingested data - let us know, so we can try to reproduce it.

Here are some typical alert queries. They have been working in our Prometheus for many years.

The queries look normal. Thank you!

petanne commented 1 year ago

Thank you for your careful answer.

Why would you do that? Disabling caches also increases the amount of work data blocks scanning. Normally, some previously read blocks are cached in memory to reduce the lookups.

In order to do a pressure test for vmstorage. The cache is enabled before this test.

Command for system-wide L3 cache miss rate

$ perf stat -a -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetch-misses sleep 10

On 13.vmstorage, when CPU is 90%.

 Performance counter stats for 'system wide':

     1,672,017,968      LLC-loads                                                     (50.02%)
       566,498,839      LLC-load-misses           #   33.88% of all LL-cache hits     (50.01%)
       307,252,131      LLC-stores                                                    (50.01%)
       130,520,462      LLC-store-misses                                              (50.01%)
   <not supported>      LLC-prefetch-misses

      10.009098222 seconds time elapsed

On 2.vmstorage, at the same time.

 Performance counter stats for 'system wide':

     1,289,752,696      LLC-loads                                                     (50.00%)
       378,444,294      LLC-load-misses           #   29.34% of all LL-cache hits     (50.00%)
       323,470,831      LLC-stores                                                    (50.00%)
       107,277,991      LLC-store-misses                                              (50.00%)
   <not supported>      LLC-prefetch-misses

      10.001611238 seconds time elapsed

As a comparison, the following is when only one vmalert process is started and the CPU is normal. On 13.vmstorage.

 Performance counter stats for 'system wide':

     1,149,413,385      LLC-loads                                                     (50.00%)
       336,088,891      LLC-load-misses           #   29.24% of all LL-cache hits     (50.00%)
       310,156,805      LLC-stores                                                    (50.00%)
       100,309,852      LLC-store-misses                                              (50.00%)
   <not supported>      LLC-prefetch-misses

      10.001550442 seconds time elapsed

On 2.vmstorage.

 Performance counter stats for 'system wide':

     1,398,281,187      LLC-loads                                                     (50.00%)
       341,360,772      LLC-load-misses           #   24.41% of all LL-cache hits     (50.00%)
       295,972,084      LLC-stores                                                    (50.00%)
        93,261,754      LLC-store-misses                                              (50.00%)
   <not supported>      LLC-prefetch-misses

      10.001704804 seconds time elapsed

How can I see how many data blocks were scanned by a query? I can't find it in the query-tracing.

petanne commented 1 year ago

From this issue until now, we use vmctl to import Prometheus historical data, and limit the import rate. There is only a short time in between when no import is performed.

Import command

/app/vmctl/vmctl-prod 
    prometheus 
    --vm-addr=http://vminsert_addr/ 
    --prom-snapshot=/data/transit/p14/prometheus/snapshots/20221206T035732Z-6aeb8e98f4c7f371/ 
    --vm-concurrency=48 
    --vm-batch-size=20480000 
    --prom-concurrency=48 
    --prom-filter-time-end=2022-11-08T00:07:01Z 
    --vm-account-id=0 
    --vm-rate-limit=30720000 
    -s=true

This command is executed on a separate new machine. I don't know if this will make a difference.

petanne commented 1 year ago

So if you're using some kind of benchmark for generating the ingested data - let us know, so we can try to reproduce it.

Sorry, no benchmark data.The time series are scraped by vmagent from our production environment.

petanne commented 1 year ago

Hi, @valyala , Could you help analyze it?

valyala commented 1 year ago

@petanne , could you try building vmstorage from the commit fbeebe4869892a9dd3c684da5fbaa0a6c511a651 and verify whether this helps resolving the issue? See build instructions. It is safe running vmstorage from this commit with other cluster components built from older commits (e.g. previous releases).

valyala commented 1 year ago

Where can I learn partition and block? What happens during a search request in vmstorage?

Please take a look at https://docs.victoriametrics.com/#storage and https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3268#issuecomment-1291249751

petanne commented 1 year ago

@petanne , could you try building vmstorage from the commit fbeebe4 and verify whether this helps resolving the issue?

I build fbeebe4, the problem still occurs. image

I found blocks_read on 13.vmstorage is high than other vmstorage, when 13.vmstorage CPU is high. I don't know if this has a clue.

rate(vm_vmselect_rpc_requests_total{srv="$srv", action="search"}[1m])
sum(rate(vm_vmselect_metric_blocks_read_total[1m])) by (instance) / sum(rate(vm_vmselect_rpc_requests_total{action="search"}[1m])) by (instance)

image

image

The pprof when CPU is high. Less flat is highest. 13.error.cpu.pprof.12.15.zip

valyala commented 1 year ago

@petanne , could you share graphs for the following queries on the same time range as graphs above if these graphs expose some anomalies around the CPU usage spikes time frames for the instance 13.vmstorage.monitor.oss, while the remaining instances have no any anomalies?

petanne commented 1 year ago

@valyala , I reproduced this problem again.

  1. Stop all of 2 vmalert instance.
  2. Upgrade all vmstorage to https://github.com/VictoriaMetrics/VictoriaMetrics/commit/fbeebe4869892a9dd3c684da5fbaa0a6c511a651 .
  3. Start one vmalert instance. Red in graphs. -evaluationInterval=30s.

image image image

valyala commented 1 year ago

@petanne , thanks for the provided graphs! Could you try building and running vmstorage from the commit 8c08d625eeb4c5b224b031f27693b62bc4403602 ?

valyala commented 1 year ago

@petanne , the commit 8c08d62 has been included in v1.85.2, so you can try the official vmstorage from the v1.85.2-cluster tag instead of building it on your own. It also includes an important bugfix for https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3502

petanne commented 1 year ago

@valyala , it works, but a new problem arises.

Upgrade 13.vmstorage to v1.85.2-cluster, other vmstorage nodes are still https://github.com/VictoriaMetrics/VictoriaMetrics/commit/fbeebe4869892a9dd3c684da5fbaa0a6c511a651 .

image image

Refer to the graph below for the historical data import of prometheus.

image

Question

  1. Why does the 1.vmstorage node CPU become high when Prometheus historical data import started?
  2. In the absence of any operational behavior, why does 13.vmstorage replace 1.vmstorage CPU becomes high?
  3. Why does the sum(rate(vm_vmselect_metric_blocks_read_total[1m])) by (instance) / sum(rate(vm_vmselect_rpc_requests_total{action="search"}[1m])) by (instance) go high when the CPU goes high?
petanne commented 1 year ago

Add to last comment, stopped the data import and the CPU problem still occurs. Then change -evaluationInterval=10s to -evaluationInterval=30s on 1.vmalert and restart it, CPU back to normal.

valyala commented 1 year ago

@petanne , thanks for providing useful graphs! It looks like the CPU usage highly correlates the the number of parts of the storage/inmemory type. Big number of data parts may significantly increase CPU usage during queries. The commit 8dc04a86f6be9be46796e9980944a7ced793a05d should reduce the maximum number of data parts under high load by increasing the chances of parts' merge in foreground during data ingestion (aka assisted merge).

@petanne , could you build vmstorage from the commit 8dc04a86f6be9be46796e9980944a7ced793a05d and verify whether this helps reducing CPU usage under your workload?

valyala commented 1 year ago

@petanne , could you try building VictoriaMetrics cluster components from the commit 103dfd0525fc4c78b6ad55a37e9be2aa45aa840f ? It contains additional performance fixes on top of those mentioned at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3425#issuecomment-1366967143 .

ikenchina commented 1 year ago

Problem is fixed after i commented the WaitIfNeeded function in PaceLimiter.

valyala commented 1 year ago

Problem is fixed after i commented the WaitIfNeeded function in PaceLimiter.

Thanks! Then the issue must be fixed in the commit 09d7fa27372fc9252e01de00a863f3b7e075b4a5 , which removes the PaceLimiter . This commit is included in the latest release.

valyala commented 9 months ago

FYI, one vmstorage node could suddenly use more CPU than other vmstorage nodes when it couldn't keep up with background merging of created indexdb parts as described at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5190 . This issue has been addressed in v1.97.0. Closing the issue as fixed then.

@petanne , please create a new issue and post a link to this one there if you still see sudden CPU usage spikes at some vmstorage nodes in VictoriaMetrics cluster running v1.97.0 or newer release.