Closed petanne closed 9 months ago
Hi @petanne! Thanks for detailed report!
Our cluster has 12 vmstorage (each 48 cores, 384GB memory and lots of free, 3% disk used)
How many vmstorage nodes are on the same instance? Are there other heavy processes running on the same instance? Could you please show RSS memory usage on problematic storage node and its memory limit?
max_over_time(process_resident_memory_bytes{}[10m])
vm_available_memory_bytes{}
Hi @hagen1778! Thanks for reply! One vmstorage process on each physical machine. There are no any other heavy process.
Red is problematic time range.
Some metrics from node_exporter.
I have 10 nodes prometheus, different targets and alert rules on each prometheus, they have been running for more than a year. In VictoriaMetrics cluster, both targets and alert rules are same as prometheus.
Does some special metrics or query can cause this problem?
Does some special metrics or query can cause this problem?
It could. But I'd expect it to happen only when vmstorage experiences memory shortage. Usually, such extensive and long CPU increase is a sign of page cache thrashing, which happens due to memory shortage or because some other process on the instance doing write/read heavy operations and polluting the cache. But according to your screenshots, vmstorage uses like 10% of available memory so this shouldn't be the case.
Do you have the following panels on your dashboard?
They might have a bit different names, but still can be used to show how heavy were queries at that time. If you don't have these panels, please find them here.
Do you have the following panels on your dashboard?
Red is problematic time range.
Another test case, i try increase -search.maxConcurrentRequests
for vmselect, problem still occurs.
I found in pprof that the lib/storage.(*TSID).Less
method is called more times than normal. What would cause this?
We try set -search.maxUniqueTimeseries
for vmstorage to 100000, problem still occurs.
I found in pprof that the lib/storage.(*TSID).Less method is called more times than normal. What would cause this?
More times or just takes more time? This method supposed to be in the top of CPU usage for all vmstorage nodes, since it is responsible for filtering data parts during lookups. The more read pressure is there the higher is impact on parts filtering.
Could you please run the following query for problematic storage?
max(
vm_cache_size_bytes{job=~"$job_storage", instance=~"$instance"}
/
vm_cache_size_max_bytes{job=~"$job", instance=~"$instance"}
) by(type)
It supposed to show cache usage % by type for the vmstorage node.
More times or just takes more time?
In problematic time range, lib/storage.(*TSID).Less
is 30% flat in the problematic vmstorage, but other vmstorage is 5%.
This is cache usage % by type for the vmstorage node, 13.vmstorage is problematic node, 1.vmstorage CPU is normal. All of other vmstorage node are like 1.vmstorage.
max(vm_cache_size_bytes{job=~"$job", instance=~"$instance"} / vm_cache_size_max_bytes{job=~"$job", instance=~"$instance"}) by (instance, type)
But this query say vm_cache_size_bytes
uses only a small amount of memory.
sum(vm_cache_size_bytes{job=~"$job", instance=~"$instance"}) by(type)
sum(vm_allowed_memory_bytes{job=~"$job", instance=~"$instance"})
I found in the source code, -storage.cacheSizeIndexDBTagFilters
default value is int(float64(memory.Allowed()) / 32)
. Does sum(vm_cache_size_bytes)
reaches the value of vm_cache_size_max_bytes
is the cause of the problem?
But all of 12 vmstorage node tagFilters
cache size is the same. It doesn't seem to explain why only one node is having problem.
vm_cache_size_bytes{type="indexdb/tagFilters"}
The commit 8e9822bc7f0f0591555be4faa76dd5af431e2000 optimizes the hostspot at TSID.Less()
function.
@petanne , could you try building vmstorage
from this commit according to these instructions and verifying whether the TSID.Less()
call is removed from CPU profile collected from the updated vmstorage
? It is safe running vmstorage
from this commit in the cluster with other v1.82.1
components.
@valyala, thanks! It worked!
The TSID.Less()
flat in the CPU pprof is 3.87% now, it was 5% when normal before.
Expecting for release. This time the sudden problem occurred in 2022-12-01 00:05 UTC, when the partition for this month is created. Could you tell me why and will it show up next month?
Victoria is great! Where can I learn partition
and block
? What happens during a search request in vmstorage?
@valyala @hagen1778 Bad news, problem happened again, and other problems arised. When we do stress tests and outage drills.
vmstorage: 8e9822b vmagent: v1.84.0-cluster vminsert: v1.84.0-cluster vmselect: v1.84.0-cluster vmalert: v1.84.0-cluster
Yesterday, Set -search.disableCache=true
in vmselect, and change -evaluationInterval=30s
to 10s
in vmalert. it works fine until the next operation.
17:45 Reboot 13.vmstorage server. The 13.vmstorage has taken up 90% of the CPU usage since the process started, until we stoped all of 2 vmalert processes when 20 minutes after. This is question 1.
18:30 Start 2 vmalert processes. Everything is ok.
19:18 Reboot 1.vmstorage server. The 13.vmstorage CPU usage increase 90% immediately. This is question 2.
19:21 1.vmstorage process is auto started, but its CPU usage increase 90% immediately, and 13.vmstorage CPU usage returns to normal. This is question 3.
19:18 ~ 19:26, vmagent and vminsert network traffic is gradually drop to zero, then gradually recover, more serious. This is question 4.
19:32 Stop all vmalert processes. 1.vmstorage CPU usage is returns to normal. And vmagent and vminsert network traffic increased by 2 times. Looks like it's blocked by 1.vmstorage. This is question 5.
19:42 Change -evaluationInterval=10s
to 30s
then start 2 vmalert processes. The 13.vmstorage CPU usage is fluctuates frequently between 15% and 30%, other vmstorage nodes behave normally until 22:35. This is question 6.
22:35 Nobody does anything, 13.vmstorage CPU burst to 90%, as said at the beginning of this issue. This is question 7.
Red is 13.vmstorage. Green is 1.vmstorage. Blue is 2.vmstorage, normal example for comparison.
Network traffic for one vmagent node.
All VM components are working smoothly, when any one machine goes down.
I read the official documentation and part of the source code, but still can't answer the above 7 questions. Maybe I'm about to lose victoria.
This is question 1.
Does pprof shows the same issue?
Just for context, we discussed the issue with @valyala before he submitted a fix. The fix isn't fixing anything, it is a workaround for an issue we suspect happens in your setup. If you take a look at the profile you provided initially, you'd find that majority of time is spent on a simple comparison operation:
. . 104:func (t *TSID) Less(b *TSID) bool {
. . 105: // Do not compare MetricIDs here as fast path for determining identical TSIDs,
. . 106: // since identical TSIDs aren't passed here in hot paths.
393.32s 399.16s 107: if t.AccountID != b.AccountID {
. . 108: return t.AccountID < b.AccountID
. . 109: }
While this operation is supposed to be very lightweight, it takes too much time in your case. The assumption is that b *TSID
doesn't fit the CPU cache, and the process takes extra time to extract it from memory. We don't know why it happens yet, it is a pretty rare situation across all the installations we had a chance to investigate.
So far, we think that adding more storage shards could help to stabilize the situation. You can significantly reduce the allocated resource per-node in favor of using more instances with lower resources for vmstorage layer.
Does pprof shows the same issue?
Sorry for not keeping the scene and save pprof. But I have another pprof with the same problem for quesion 6. 13.cpu.pprof.12.6.zip
I will try to get pprof when CPU 90% again.
Based on our capacity assessment and machine resources, we use 12 vmstorage server, each 48 cores, 384GB memory and 15TB disk. Although the machine configuration is high, but the data volume and usage rate are low now. I'm sorry there are no more machines to replace at this time.
I am not using multi-tenancy, each AccountID
is default value or 0. Could this cause some unexpected behavior?
Only start one vmalert with -evaluationInterval=10s
, 13.vmstorage CPU is 90%. storage.(*TSID).Less
called from storage.(*partSearch).nextBHS
is biggest CPU usage again.
github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.(*partSearch).nextBHS
/root/VictoriaMetrics/lib/storage/tsid.go
Total: 177.22s 178.12s (flat, cum) 14.76%
107 169.88s 170.75s if t.AccountID != b.AccountID {
Does nextBHS need optimization for Less
?
func (ps *partSearch) nextBHS() bool {
for len(ps.metaindex) > 0 {
// Optimization: skip tsid values smaller than the minimum value
// from ps.metaindex.
for ps.BlockRef.bh.TSID.Less(&ps.metaindex[0].TSID) {
if !ps.nextTSID() {
return false
}
}
As a supplementary note, in the above case, the replication of series data on 13.vmstorage will be on the 1.vmstorage.
I want to known why are so many data blocks retrieved on 13.vmstorage? I checked all alert queries and found no suspicious.
Based on our capacity assessment and machine resources, we use 12 vmstorage server, each 48 cores, 384GB memory and 15TB disk.
What I'm saying is that you can replace 48 cores machine with 3 machines 16 cores each and if the assumption with CPU cache is correct - the problem will be solved. But I understand it could be complicated when not running in cloud.
Could this cause some unexpected behavior?
No.
I want to known why are so many data blocks retrieved on 13.vmstorage?
It is unclear. We haven't observed such situations before. Even for a much bigger load and less hardware. Could you please share an example of the alert queries you run?
I'd also wait for @valyala response here, since he has more understanding of this. He's in US time zone and hopefully will be able to respond later.
Yes, we use physical machines, not cloud. Sorry, there are not enough machines to replace 48 cores with 3 * 16 cores.
Could you please share an example of the alert queries you run?
Here are some typical alert queries. They have been working in our Prometheus for many years.
up{port="9100"} == 0
sum(increase(log_monitor_counter{level="err"}[1m])) >= 10
or
log_monitor_counter{level="err"} >= 10
unless log_monitor_counter{level="err"} offset 1m
unless log_monitor_counter{level="err"} offset 10m
sum(rate(nginx_http_upstream_request_times_count{endpoint=~".*xxxx"}[1m])) / sum(rate(nginx_http_upstream_request_times_count{endpoint=~".*xxxx"}[1m] offset 1d)) < 0.8
and
sum(rate(nginx_http_upstream_request_times_count{endpoint=~".*xxxx"}[1m])) / sum(rate(nginx_http_upstream_request_times_count{endpoint=~".*xxxx"}[1m] offset 1w)) < 0.8
rate(is_machine_alive_change{}[3m]) < 0.002 and (hour() > 19 or hour() < 16)
Set -search.disableCache=true in vmselect
Why would you do that? Disabling caches also increases the amount of work data blocks scanning. Normally, some previously read blocks are cached in memory to reduce the lookups.
17:45 Reboot 13.vmstorage server. The 13.vmstorage has taken up 90% of the CPU usage since the process started, until we stoped all of 2 vmalert processes when 20 minutes after. This is question 1.
It could be that after the restart caches for 13.vmstorage didn't fill up caches enough to remain efficient. It usually takes not that much time, but here we assume that load on the node is too high (due to unknown reason so far).
19:18 Reboot 1.vmstorage server. The 13.vmstorage CPU usage increase 90% immediately. This is question 2.
When one storage goes off, vminserts start to re-route data from the "dead" storage to other "healthy" storages. Re-routing always increases the load on the rest of the nodes. And if our theory of vmstorage-running-on-the-edge is correct, then this additional load only increases the probability of entering this state.
19:21 1.vmstorage process is auto started, but its CPU usage increase 90% immediately, and 13.vmstorage CPU usage returns to normal. This is question 3.
It is likely the same situation as in q1 - restoring the caches under read queries load. 13.vmstorage stopped receiving additional load from rerouting and got back to normal.
19:18 ~ 19:26, vmagent and vminsert network traffic is gradually drop to zero, then gradually recover, more serious. This is question 4.
There are two types of re-routing in vm cluster:
In your case, re-rerouting p2 was disabled. So when one vmstorage became slow - the ingestion speed was limited by the slowest vmstorage node in the cluster. You need to enable re-routing by setting -disableRerouting=false
on vminsert nodes.
19:32 Stop all vmalert processes. 1.vmstorage CPU usage is returns to normal. And vmagent and vminsert network traffic increased by 2 times. Looks like it's blocked by 1.vmstorage. This is question 5.
Read load was reduced, so 1.vmstorage got back to normal for processing incoming data.
19:42 Change -evaluationInterval=10s to 30s then start 2 vmalert processes. The 13.vmstorage CPU usage is fluctuates frequently between 15% and 30%, other vmstorage nodes behave normally until 22:35. This is question 6. 22:35 Nobody does anything, 13.vmstorage CPU burst to 90%, as said at the beginning of this issue. This is question 7.
Back to my theory, I assume vmstorage nodes in this setup for some reason run-on-the-edge in a matter of compute resources. We have seen similar behaviour but only for setups where memory, cpu or disk were very close to 90-99% of utilization. Your case is different. According to the provided data, there are plenty of non-utilized resources. So the only assumption so far is the CPU cache issues. When read load crosses the "edge" - vmstorage falls into vicious circle where normal operations start to take too much time. This is when it enters into 90% CPU utilization.
Let's try to verify this theory by checking the CPU cache miss ratio in normal operation and faulty operation. Go processes do not have access to this info, so I'd like to ask you to take the measurements on your own. Please, take a look at this discussion. This comment provides an example of how to measure the misses in L3 cache - would be great if you can make the measurements before and during the incident. Please note the sleep
command in the end, it is likely an interval of the measurement. It would be better to extend it during measurements.
Yes, we use physical machines, not cloud. Sorry, there are not enough machines to replace 48 cores with 3 * 16 cores.
Understandable. If that's an option, you can try using the cloud for running the test once again, but on the cloud's hardware. You can rule out the hardware issue or try different topologies by using cloud, even if temporary.
As I said earlier, we haven't observed the issue like yours before. So if you're using some kind of benchmark for generating the ingested data - let us know, so we can try to reproduce it.
Here are some typical alert queries. They have been working in our Prometheus for many years.
The queries look normal. Thank you!
Thank you for your careful answer.
Why would you do that? Disabling caches also increases the amount of work data blocks scanning. Normally, some previously read blocks are cached in memory to reduce the lookups.
In order to do a pressure test for vmstorage. The cache is enabled before this test.
$ perf stat -a -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetch-misses sleep 10
On 13.vmstorage, when CPU is 90%.
Performance counter stats for 'system wide':
1,672,017,968 LLC-loads (50.02%)
566,498,839 LLC-load-misses # 33.88% of all LL-cache hits (50.01%)
307,252,131 LLC-stores (50.01%)
130,520,462 LLC-store-misses (50.01%)
<not supported> LLC-prefetch-misses
10.009098222 seconds time elapsed
On 2.vmstorage, at the same time.
Performance counter stats for 'system wide':
1,289,752,696 LLC-loads (50.00%)
378,444,294 LLC-load-misses # 29.34% of all LL-cache hits (50.00%)
323,470,831 LLC-stores (50.00%)
107,277,991 LLC-store-misses (50.00%)
<not supported> LLC-prefetch-misses
10.001611238 seconds time elapsed
As a comparison, the following is when only one vmalert process is started and the CPU is normal. On 13.vmstorage.
Performance counter stats for 'system wide':
1,149,413,385 LLC-loads (50.00%)
336,088,891 LLC-load-misses # 29.24% of all LL-cache hits (50.00%)
310,156,805 LLC-stores (50.00%)
100,309,852 LLC-store-misses (50.00%)
<not supported> LLC-prefetch-misses
10.001550442 seconds time elapsed
On 2.vmstorage.
Performance counter stats for 'system wide':
1,398,281,187 LLC-loads (50.00%)
341,360,772 LLC-load-misses # 24.41% of all LL-cache hits (50.00%)
295,972,084 LLC-stores (50.00%)
93,261,754 LLC-store-misses (50.00%)
<not supported> LLC-prefetch-misses
10.001704804 seconds time elapsed
How can I see how many data blocks were scanned by a query? I can't find it in the query-tracing.
From this issue until now, we use vmctl
to import Prometheus historical data, and limit the import rate. There is only a short time in between when no import is performed.
/app/vmctl/vmctl-prod
prometheus
--vm-addr=http://vminsert_addr/
--prom-snapshot=/data/transit/p14/prometheus/snapshots/20221206T035732Z-6aeb8e98f4c7f371/
--vm-concurrency=48
--vm-batch-size=20480000
--prom-concurrency=48
--prom-filter-time-end=2022-11-08T00:07:01Z
--vm-account-id=0
--vm-rate-limit=30720000
-s=true
This command is executed on a separate new machine. I don't know if this will make a difference.
So if you're using some kind of benchmark for generating the ingested data - let us know, so we can try to reproduce it.
Sorry, no benchmark data.The time series are scraped by vmagent from our production environment.
Hi, @valyala , Could you help analyze it?
@petanne , could you try building vmstorage
from the commit fbeebe4869892a9dd3c684da5fbaa0a6c511a651 and verify whether this helps resolving the issue? See build instructions. It is safe running vmstorage
from this commit with other cluster components built from older commits (e.g. previous releases).
Where can I learn partition and block? What happens during a search request in vmstorage?
Please take a look at https://docs.victoriametrics.com/#storage and https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3268#issuecomment-1291249751
@petanne , could you try building
vmstorage
from the commit fbeebe4 and verify whether this helps resolving the issue?
I build fbeebe4, the problem still occurs.
I found blocks_read
on 13.vmstorage is high than other vmstorage, when 13.vmstorage CPU is high. I don't know if this has a clue.
rate(vm_vmselect_rpc_requests_total{srv="$srv", action="search"}[1m])
sum(rate(vm_vmselect_metric_blocks_read_total[1m])) by (instance) / sum(rate(vm_vmselect_rpc_requests_total{action="search"}[1m])) by (instance)
The pprof when CPU is high. Less
flat is highest.
13.error.cpu.pprof.12.15.zip
@petanne , could you share graphs for the following queries on the same time range as graphs above if these graphs expose some anomalies around the CPU usage spikes time frames for the instance 13.vmstorage.monitor.oss
, while the remaining instances have no any anomalies?
vm_parts{type="storage/inmemory"}
vm_parts{type="storage/small"}
vm_parts{type="indexdb/inmemory"}
vm_parts{type="indexdb/file"}
rate(process_io_storage_read_bytes_total{instance=~".+vmstorage.+"})
rate(process_io_storage_written_bytes_total{instance=~".+vmstorage.+"})
@valyala , I reproduced this problem again.
-evaluationInterval=30s
.
@petanne , thanks for the provided graphs! Could you try building and running vmstorage
from the commit 8c08d625eeb4c5b224b031f27693b62bc4403602 ?
@petanne , the commit 8c08d62 has been included in v1.85.2, so you can try the official vmstorage
from the v1.85.2-cluster
tag instead of building it on your own. It also includes an important bugfix for https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3502
@valyala , it works, but a new problem arises.
Upgrade 13.vmstorage to v1.85.2-cluster
, other vmstorage nodes are still https://github.com/VictoriaMetrics/VictoriaMetrics/commit/fbeebe4869892a9dd3c684da5fbaa0a6c511a651 .
Refer to the graph below for the historical data import of prometheus.
sum(rate(vm_vmselect_metric_blocks_read_total[1m])) by (instance) / sum(rate(vm_vmselect_rpc_requests_total{action="search"}[1m])) by (instance)
go high when the CPU goes high?Add to last comment, stopped the data import and the CPU problem still occurs.
Then change -evaluationInterval=10s
to -evaluationInterval=30s
on 1.vmalert and restart it, CPU back to normal.
@petanne , thanks for providing useful graphs! It looks like the CPU usage highly correlates the the number of parts of the storage/inmemory
type. Big number of data parts may significantly increase CPU usage during queries. The commit 8dc04a86f6be9be46796e9980944a7ced793a05d should reduce the maximum number of data parts under high load by increasing the chances of parts' merge in foreground during data ingestion (aka assisted merge).
@petanne , could you build vmstorage
from the commit 8dc04a86f6be9be46796e9980944a7ced793a05d and verify whether this helps reducing CPU usage under your workload?
@petanne , could you try building VictoriaMetrics cluster components from the commit 103dfd0525fc4c78b6ad55a37e9be2aa45aa840f ? It contains additional performance fixes on top of those mentioned at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3425#issuecomment-1366967143 .
Problem is fixed after i commented the WaitIfNeeded function in PaceLimiter.
Problem is fixed after i commented the WaitIfNeeded function in PaceLimiter.
Thanks! Then the issue must be fixed in the commit 09d7fa27372fc9252e01de00a863f3b7e075b4a5 , which removes the PaceLimiter . This commit is included in the latest release.
FYI, one vmstorage
node could suddenly use more CPU than other vmstorage
nodes when it couldn't keep up with background merging of created indexdb
parts as described at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5190 . This issue has been addressed in v1.97.0. Closing the issue as fixed then.
@petanne , please create a new issue and post a link to this one there if you still see sudden CPU usage spikes at some vmstorage
nodes in VictoriaMetrics cluster running v1.97.0 or newer release.
Describe the bug Our cluster has 12 vmstorage (each 48 cores, 384GB memory and lots of free, 3% disk used), 4 vmselect, 4 vminsert, 4 vmagent, 1 vmalert. Only vmalert making read requests. But one vmstorage node CPU suddenly becomes high without human operation, neither insert requests nor search requests increased. When we stop the vmalert, it recovered. I have replaced the machine and double checked, it should not be a machine problem.
To Reproduce Start vmalert again, nothing happened. But i have another way to reproduce: Add a new vmalert and start it, one vmstorage CPU becomes high quickly, in about 1 minute.Then stop the problematic vmstorage, another vmstorage CPU becomes high quickly. When stop the second problematic vmstorage, another vmstorage will reproduce again.
Expected behavior vmstorage CPU running normally.
Logs no error or warning log in vmstorage.
Screenshots stats
vmstorage CPU
A test case, stop problematic vmstorage, looks like at least one vmstorage cpu must high?
I think may be something wrong in vmstorage? Because it's insert and search requests is balanced and unanimously.
rate(vm_vminsert_metrics_read_total[1m])
rate(vm_vmselect_rpc_requests_total[1m])
vmstorage other metrics
problematic vmstorage go metric
pprof and trace when CPU high pprof_and_trace.zip
Version
Used command-line flags