Open ShyunnY opened 3 months ago
hey @ShyunnY
looks like this metric can be useful only if channel's size is growing slowly (if it's growing fast, you can see an increase of vmalert_remotewrite_dropped_rows_total
counter), but i'm not sure if it's a common case
thanks for your answer @AndrewChubatiuk
In fact, when there are a large number of metrics generated quickly and a lot of calculations are required, we may really need to observe the metrics provided by the input channel
.
vmalert_remotewrite_dropped_rows_total
may help us observe, but the effect is not great
Imagine: we can dynamically adjust remoteWrite.maxQueueSize
or increase remoteWrite.concurrency
according to the Gauge changes of input channel
to avoid metrics being dropped. I think this looks great!
we can dynamically adjust remoteWrite.maxQueueSize or increase remoteWrite.concurrency
could you please describe this setup in details? what a tool do you consider using for this adjustment?
This configuration is not the focus of our discussion, it is just my way of solving the problem.
I think we should focus on: whether a Gauge metrics should be provided for the input channel.
WDYT!
I don't have an impression that adding this metric without a use case of how it can be useful makes sense, that's why i was asking about a setup in which it could be utilized. I can imagine that connections count, latency, cpu usage, queue size, etc could be metrics, which autoscaler relies on to scale pods either vertically or horizontally, but regarding the case, that you've described i was interested in how are you going to make an assessment whether flags values or pod resources should be changed and what tool will be responsible for such decisions
Also looks like for node or pod with static amount of resources optimal value for both remoteWrite.maxQueueSize
, remoteWrite.concurrency
is predictable and there're no benefits from making them dynamic. Concurrency should be proportional to a core count and there are no benefits from changing maxQueueSize dynamically if agent with optimal concurrency set doesn't keep up
@AndrewChubatiuk Thanks for your reply, it is much appreciated :)
Let me tell you more information:
Background: We did not use k8s cluster for deployment, but traditional physical machines for deployment (this brings a problem: we cannot observe some vmalert metrics and dynamically expand the vmalert pod horizontally)
Problem: We are initially trying to migrate from prometheus to vm, and most of the components in the vm are running well (we are very satisfied with it). During the daily business peak, we found that the metrics increased sharply, which put a lot of pressure on our vmalert (the alert rule we configured is relatively complex), causing the vmalert write speed to not keep up and discard a large number of metrics.
How to solve: After we discovered this problem, we solved it by adjusting the number of remoteWrite.concurrency
.
My idea: When we increased concurrency
, we found that this would put a lot of pressure on the memory and CPU of our physical machine, which in turn affected other applications. Therefore, we can provide a metric in the input channel
so that we can observe the status of remoteWrite processing, allowing us to dynamically expand the vmalert component at different times(using scripts or other methods) or dynamically modify the vmalert configuration based on this metric.
Additional information: As you said, using vmalert_remotewrite_dropped_rows_total
can also reflect a certain state of remoteWrite
, but when this metrics changes, it means that the pressure on the remoteWrite channel
is too high. We don't want to drop metrics.
hey,@AndrewChubatiuk have you started this? I've modified it. If possible, I would like to use it as my first PR
Hey @ShyunnY I've linked PR to this issue accidentally. I have a question regarding your approach. You've mentioned
we found that this would put a lot of pressure on the memory and CPU of our physical machine, which in turn affected other applications
why don't you limit resources for VMAlert to get rid of impact on other applications?
It looks like we are stuck in a communication black hole, which doesn’t look good.
When resources are adjusted too small, metrics processing cannot keep up and will be discarded.
So: dynamically adjusting our resources based on the length metrics of the input
channel is a good solution.
you've mentioned, that you cannot permanently increase neither concurrency nor maxQueueSize as it impacts other applications, that are running on the same host. so how this new metric should help you to deal with a load on vmalert, when you have no extra resources for it?
Let me describe it to you in another way:
vmalert
initially runs with fewer resources (saving resources) -> during peak hours, metrics increase dramatically, causing remotewrite to process too slowly and discard a large number of metrics -> (assuming we can monitor the length in input
, we can temporarily increase resources in vmalert
when the threshold we want is reached)
additional note: I have tried forking and providing the metrics for internal grayscale testing, and it seems to work well. The number of missing metrics has been greatly reduced.
Is your feature request related to a problem? Please describe
When I use
vmalert
to calculate some complex metrics alarm information, I found that when the number of metrics generated is large and the rate is fast, it will cause the metrics to be discarded instead of being written tovmstore
.I checked the source code, found that the client passed
prompbmarshal.TimeSeries
through theinput channel
, so when themaxBatchSize
option is configured to be small or the-remoteWrite.concurrency
flags is not configured, the client's processing speed will not keep up.So I wonder if we can add additional metrics to the
input channel
so that we can monitor the number of metrics stored in the channel but not yet processed.Describe the solution you'd like
It looks like we can add an additional Gauge metric to monitor changes in the
input channel
Before
s
enters the input we increase the Gauge: https://github.com/VictoriaMetrics/VictoriaMetrics/blob/4a3d1f4848aca35a7aa810060e876ca47a6ac8d4/app/vmalert/remotewrite/client.go#L121-L130After
s
leaves input we decrease Gauge: https://github.com/VictoriaMetrics/VictoriaMetrics/blob/4a3d1f4848aca35a7aa810060e876ca47a6ac8d4/app/vmalert/remotewrite/client.go#L187-L194Describe alternatives you've considered
No response
Additional information
No response