Open evairmarinho opened 2 years ago
Hi @evairmarinho , roughly how many alerts would you like to have in your workspace?
Is this really going to be a thing and when? I'm also in need of 400+ alerts in one workspace.
What is the reason for the limit of 100? Is it an arbitrary limit or a limitation on the rules engine?
We also need extend total alert threshold, 100 alerts count is too small .......
This limit is too low, and alerting is indeed something that affects performances and resources, and the reason to scale. But scaling is also the reason why people would choose the managed grafana over a self-hosted Grafana.
No any updates? we are also using single workspace for all workload.
We are using grafana to monitor 148 hosts on 19 different metrics, each of which has a multi-dimensional alerting rule, so that makes it 2812 alert instances, and we expect this to increase over time.
@LuigiClemente-Awin is 100% right on here. We would much rather leave the scaling of Grafana to a managed service, but the limit of 100 alerts makes it a complete non-starter.
This suggestion from the docs is somewhat laughable:
You can lower the number of alert instances by removing alert rules, or by editing multidimensional alerts to have fewer alert instances (for example, by having one alert on errors per VM, rather than one alert on error per API in a VM).
Even with the 50-60 VMs example load mentioned earlier in the doc, following this suggestion means that half of the workspace's alerting capacity is used up. Reduce the alerting granularity further, and the alert becomes: "One of your 50 VMs is having errors; good luck figuring out which one!"
This suggestion from the docs is somewhat laughable:
You can lower the number of alert instances by removing alert rules, or by editing multidimensional alerts to have fewer alert instances (for example, by having one alert on errors per VM, rather than one alert on error per API in a VM).
Even with the 50-60 VMs example load mentioned earlier in the doc, following this suggestion means that half of the workspace's alerting capacity is used up. Reduce the alerting granularity further, and the alert becomes: "One of your 50 VMs is having errors; good luck figuring out which one!"
I feel like this is a case where you can reduce granularity for the alert and link to a dashboard that allows one to drill into specifics; you should have this dashboard already if you were considering individual alerts in the first place. Reducing cardinality is called out on Grafana's documentation on alerting performance considations. But it definitely seems ideal to have grouping of these if the VMs are completely unrelated, so maybe 50-60 VMs could turn into 3-6 alerts.
The non-adjustable 100 rule quota is very disappointing. One can reasonably optimize it to keep it down, but eventually that cap will be reached with a lot of different metrics.
Bumping this thread.
Would it be possible to get an official response regarding whether this limit can be increased at all (either directly by AWS or by asking for a quota increase)?
The 100 alarm limit is plain too low for any serious, mildly complex system. Bear in mind it's not just VMs people monitor with Grafana, but all kinds of things such as sqs queues, event bridge events, lambda invocations and so on.
Staying within the 100 alarm limit becomes nigh impossible when you want to track so many moving parts.
Understand the ask, and totally get the disappointment with an alert limit of 100. Can folks help answer few questions?
We also need to go beyond the 100 alert limit that is currently in place.
Thanks for sharing the use-cases! @chris13524 Managed Prometheus provides a more scalable way for creating alert rules, refer to the docs here. In addition, you can visualize these alerts within your Amazon Managed Grafana workspace, blog post here. For Cloudwatch, you can either create alerts within Cloudwatch or to centralize you can export your Cloudwatch metrics to Amazon Managed service for Prometheus using Metric Streams or Cloudwatch exporter. This is not ideal, but can be considered in the interim. This will facilitate IAC use cases more. Let me know if the driver for alerts in Grafana is the preference for an interactive alert rules creation experience.
Having a hard limit of 100 is seriously limiting. If you have a handful of alerts per service and several environments, as soon as you have more than about 10 services you're done. I guess you could opt for a multi-workspace setup but this massively increases costs if you have a relatively small number of developers who span multiple services - needing a login for each workspace.
Not sure I like the idea of using AMP to offset this limitation. AMP alerts aren't free and Grafana provides a better user experience for more people which is quite crucial when it comes to observability.
Is there an ETA on when this will be either addressed or rejected?
My suggestion would be to set a default limit of 100 with a hard limit of 1000. That way you're still encouraging people to make at least some effort in alert efficiency without becoming a blocker.
@VermaPriyanka Thanks for the suggestion of using Managed Prometheus for alert rules. It seems like you are suggesting
This seems like a reasonable workaround to the 100 limit. After reading the linked article https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-Ruler.html, I can setup the alertmanager datasource and see the prometheus alerts in Grafana. But what is unclear is what Grafana does with that alert data. Does Grafana Alert Manager then try to send out its own alerts, resulting in duplicate alerts being sent? If so how do we prevent this from happening?
@twellspring Grafana only provides the visualization of rules and firing alerts, all the processing is handled by AMP ruler and alertmanager, so no duplicate alerts that you would see with Grafana managed alerts.
@VermaPriyanka Are you able to provide us with an update on this request at all? I'd be happy to share some more on our use case if that's helpful?
@tb00-cloud We are looking into the limit increase request. However, AMP alerts are not charged per alert, but for the queries you make to AMP. If the same alerts are defined from Grafana for data-sources such as AMP, Cloudwatch, you would actually see higher query costs due to the way Grafana's HA is implemented , i.e. it does not deduplicate query evaluations. Read here for more details. That said, we understand that Grafana provides a more friendly interface to manage these alerts and addresses alerting needs for many other data sources. For those use-cases, we are looking at safely increasing Grafana alert limits, so a high number of alerts does not degrade your visualization experience.
is anyone looking into this?
https://docs.aws.amazon.com/grafana/latest/userguide/AMG_quotas.html