VictoriaMetrics / VictoriaMetrics

VictoriaMetrics: fast, cost-effective monitoring solution and time series database
https://victoriametrics.com/
Apache License 2.0
12.16k stars 1.2k forks source link

Whether to add additional metrics to remoteWrite Client in vmalert #6542

Open ShyunnY opened 3 months ago

ShyunnY commented 3 months ago

Is your feature request related to a problem? Please describe

When I use vmalert to calculate some complex metrics alarm information, I found that when the number of metrics generated is large and the rate is fast, it will cause the metrics to be discarded instead of being written to vmstore.

I checked the source code, found that the client passed prompbmarshal.TimeSeries through the input channel, so when the maxBatchSize option is configured to be small or the -remoteWrite.concurrency flags is not configured, the client's processing speed will not keep up.

So I wonder if we can add additional metrics to the input channel so that we can monitor the number of metrics stored in the channel but not yet processed.

Describe the solution you'd like

It looks like we can add an additional Gauge metric to monitor changes in the input channel

Before s enters the input we increase the Gauge: https://github.com/VictoriaMetrics/VictoriaMetrics/blob/4a3d1f4848aca35a7aa810060e876ca47a6ac8d4/app/vmalert/remotewrite/client.go#L121-L130

After s leaves input we decrease Gauge: https://github.com/VictoriaMetrics/VictoriaMetrics/blob/4a3d1f4848aca35a7aa810060e876ca47a6ac8d4/app/vmalert/remotewrite/client.go#L187-L194

Describe alternatives you've considered

No response

Additional information

No response

AndrewChubatiuk commented 3 months ago

hey @ShyunnY looks like this metric can be useful only if channel's size is growing slowly (if it's growing fast, you can see an increase of vmalert_remotewrite_dropped_rows_total counter), but i'm not sure if it's a common case

ShyunnY commented 3 months ago

thanks for your answer @AndrewChubatiuk

In fact, when there are a large number of metrics generated quickly and a lot of calculations are required, we may really need to observe the metrics provided by the input channel.

vmalert_remotewrite_dropped_rows_total may help us observe, but the effect is not great

Imagine: we can dynamically adjust remoteWrite.maxQueueSize or increase remoteWrite.concurrency according to the Gauge changes of input channel to avoid metrics being dropped. I think this looks great!

AndrewChubatiuk commented 3 months ago

we can dynamically adjust remoteWrite.maxQueueSize or increase remoteWrite.concurrency

could you please describe this setup in details? what a tool do you consider using for this adjustment?

ShyunnY commented 3 months ago

This configuration is not the focus of our discussion, it is just my way of solving the problem.

I think we should focus on: whether a Gauge metrics should be provided for the input channel.

WDYT!

AndrewChubatiuk commented 3 months ago

I don't have an impression that adding this metric without a use case of how it can be useful makes sense, that's why i was asking about a setup in which it could be utilized. I can imagine that connections count, latency, cpu usage, queue size, etc could be metrics, which autoscaler relies on to scale pods either vertically or horizontally, but regarding the case, that you've described i was interested in how are you going to make an assessment whether flags values or pod resources should be changed and what tool will be responsible for such decisions

Also looks like for node or pod with static amount of resources optimal value for both remoteWrite.maxQueueSize, remoteWrite.concurrency is predictable and there're no benefits from making them dynamic. Concurrency should be proportional to a core count and there are no benefits from changing maxQueueSize dynamically if agent with optimal concurrency set doesn't keep up

ShyunnY commented 3 months ago

@AndrewChubatiuk Thanks for your reply, it is much appreciated :)

Let me tell you more information:

  1. Background: We did not use k8s cluster for deployment, but traditional physical machines for deployment (this brings a problem: we cannot observe some vmalert metrics and dynamically expand the vmalert pod horizontally)

  2. Problem: We are initially trying to migrate from prometheus to vm, and most of the components in the vm are running well (we are very satisfied with it). During the daily business peak, we found that the metrics increased sharply, which put a lot of pressure on our vmalert (the alert rule we configured is relatively complex), causing the vmalert write speed to not keep up and discard a large number of metrics.

  3. How to solve: After we discovered this problem, we solved it by adjusting the number of remoteWrite.concurrency.

  4. My idea: When we increased concurrency, we found that this would put a lot of pressure on the memory and CPU of our physical machine, which in turn affected other applications. Therefore, we can provide a metric in the input channel so that we can observe the status of remoteWrite processing, allowing us to dynamically expand the vmalert component at different times(using scripts or other methods) or dynamically modify the vmalert configuration based on this metric.

  5. Additional information: As you said, using vmalert_remotewrite_dropped_rows_total can also reflect a certain state of remoteWrite, but when this metrics changes, it means that the pressure on the remoteWrite channel is too high. We don't want to drop metrics.

ShyunnY commented 3 months ago

hey,@AndrewChubatiuk have you started this? I've modified it. If possible, I would like to use it as my first PR

AndrewChubatiuk commented 3 months ago

Hey @ShyunnY I've linked PR to this issue accidentally. I have a question regarding your approach. You've mentioned

we found that this would put a lot of pressure on the memory and CPU of our physical machine, which in turn affected other applications

why don't you limit resources for VMAlert to get rid of impact on other applications?

ShyunnY commented 3 months ago

It looks like we are stuck in a communication black hole, which doesn’t look good.

When resources are adjusted too small, metrics processing cannot keep up and will be discarded. So: dynamically adjusting our resources based on the length metrics of the input channel is a good solution.

AndrewChubatiuk commented 3 months ago

you've mentioned, that you cannot permanently increase neither concurrency nor maxQueueSize as it impacts other applications, that are running on the same host. so how this new metric should help you to deal with a load on vmalert, when you have no extra resources for it?

ShyunnY commented 3 months ago

Let me describe it to you in another way:

vmalert initially runs with fewer resources (saving resources) -> during peak hours, metrics increase dramatically, causing remotewrite to process too slowly and discard a large number of metrics -> (assuming we can monitor the length in input, we can temporarily increase resources in vmalert when the threshold we want is reached)

additional note: I have tried forking and providing the metrics for internal grayscale testing, and it seems to work well. The number of missing metrics has been greatly reduced.