aws / amazon-managed-grafana-roadmap

Amazon Managed Grafana Roadmap
Other
59 stars 4 forks source link

Increase Alert Limits Per Workspace #21

Open evairmarinho opened 2 years ago

evairmarinho commented 2 years ago

https://docs.aws.amazon.com/grafana/latest/userguide/AMG_quotas.html

mengdic commented 1 year ago

Hi @evairmarinho , roughly how many alerts would you like to have in your workspace?

evairmarinho commented 1 year ago
  1. I would like use a single workspace for centralize all my alerts.
tashcraft36 commented 1 year ago

Is this really going to be a thing and when? I'm also in need of 400+ alerts in one workspace.

What is the reason for the limit of 100? Is it an arbitrary limit or a limitation on the rules engine?

zrzzxw commented 1 year ago

We also need extend total alert threshold, 100 alerts count is too small .......

LuigiClemente-Awin commented 1 year ago

This limit is too low, and alerting is indeed something that affects performances and resources, and the reason to scale. But scaling is also the reason why people would choose the managed grafana over a self-hosted Grafana.

gavin-jeong commented 1 year ago

No any updates? we are also using single workspace for all workload.

brettdh commented 1 year ago

We are using grafana to monitor 148 hosts on 19 different metrics, each of which has a multi-dimensional alerting rule, so that makes it 2812 alert instances, and we expect this to increase over time.

@LuigiClemente-Awin is 100% right on here. We would much rather leave the scaling of Grafana to a managed service, but the limit of 100 alerts makes it a complete non-starter.

brettdh commented 1 year ago

This suggestion from the docs is somewhat laughable:

You can lower the number of alert instances by removing alert rules, or by editing multidimensional alerts to have fewer alert instances (for example, by having one alert on errors per VM, rather than one alert on error per API in a VM).

Even with the 50-60 VMs example load mentioned earlier in the doc, following this suggestion means that half of the workspace's alerting capacity is used up. Reduce the alerting granularity further, and the alert becomes: "One of your 50 VMs is having errors; good luck figuring out which one!"

cdecoux commented 11 months ago

This suggestion from the docs is somewhat laughable:

You can lower the number of alert instances by removing alert rules, or by editing multidimensional alerts to have fewer alert instances (for example, by having one alert on errors per VM, rather than one alert on error per API in a VM).

Even with the 50-60 VMs example load mentioned earlier in the doc, following this suggestion means that half of the workspace's alerting capacity is used up. Reduce the alerting granularity further, and the alert becomes: "One of your 50 VMs is having errors; good luck figuring out which one!"

I feel like this is a case where you can reduce granularity for the alert and link to a dashboard that allows one to drill into specifics; you should have this dashboard already if you were considering individual alerts in the first place. Reducing cardinality is called out on Grafana's documentation on alerting performance considations. But it definitely seems ideal to have grouping of these if the VMs are completely unrelated, so maybe 50-60 VMs could turn into 3-6 alerts.

The non-adjustable 100 rule quota is very disappointing. One can reasonably optimize it to keep it down, but eventually that cap will be reached with a lot of different metrics.

petrisorciprian-vitals commented 10 months ago

Bumping this thread.

Would it be possible to get an official response regarding whether this limit can be increased at all (either directly by AWS or by asking for a quota increase)?

The 100 alarm limit is plain too low for any serious, mildly complex system. Bear in mind it's not just VMs people monitor with Grafana, but all kinds of things such as sqs queues, event bridge events, lambda invocations and so on.

Staying within the 100 alarm limit becomes nigh impossible when you want to track so many moving parts.

VermaPriyanka commented 10 months ago

Understand the ask, and totally get the disappointment with an alert limit of 100. Can folks help answer few questions?

  1. What are the different data sources you are alerting on, from Amazon Managed Grafana?
  2. On an average, how many alert instances do you expect from different data sources?
  3. Do you create single data source alerts in Grafana, or multi-datasource, i.e. a single alert rule querying two or more data sources such as Amazon Cloudwatch, Amazon Managed Service for Prometheus or Amazon OpenSearch?
  4. What are the key drivers for creating these alert rules in Grafana instead of creating them in source tools such as Amazon Cloudwatch, Amazon Managed Service for Prometheus or Amazon OpenSearch?
brettdh commented 10 months ago
  1. AWS Timestream
  2. See https://github.com/aws/amazon-managed-grafana-roadmap/issues/21#issuecomment-1700835507
  3. See https://github.com/aws/amazon-managed-grafana-roadmap/issues/21#issuecomment-1700835507
  4. Grafana is the de facto standard for timeseries metrics visualization and alerting, and unlike any of the services you mentioned, integration with Timestream is highlighted in the Timestream documentation regarding integration with other services.
    • I'm aware that CloudWatch also supports monitoring Timestream, but AFAICT, that's only for metrics about Timestream databases or tables that are already present in CloudWatch, not about monitoring the results of Timestream queries on timeseries data in those databases/tables.
    • In any case, we've built up many queries over lots of time and experimentation, and we don't have a reason to switch at this time.
chris13524 commented 10 months ago
  1. Managed Prometheus and CloudWatch
  2. Hundreds from Prometheus
  3. Each alert typically queries an individual datasource, either Prometheus or CloudWatch. We would have hundreds of charts spanning different metrics and services and environments, with each chart/metric having 1 alert.
  4. Grafana is de-facto and not aware of an ability to alert on Prometheus metrics via CloudWatch
ademartini-czi commented 10 months ago

We also need to go beyond the 100 alert limit that is currently in place.

VermaPriyanka commented 10 months ago

Thanks for sharing the use-cases! @chris13524 Managed Prometheus provides a more scalable way for creating alert rules, refer to the docs here. In addition, you can visualize these alerts within your Amazon Managed Grafana workspace, blog post here. For Cloudwatch, you can either create alerts within Cloudwatch or to centralize you can export your Cloudwatch metrics to Amazon Managed service for Prometheus using Metric Streams or Cloudwatch exporter. This is not ideal, but can be considered in the interim. This will facilitate IAC use cases more. Let me know if the driver for alerts in Grafana is the preference for an interactive alert rules creation experience.

tb00-cloud commented 9 months ago

Having a hard limit of 100 is seriously limiting. If you have a handful of alerts per service and several environments, as soon as you have more than about 10 services you're done. I guess you could opt for a multi-workspace setup but this massively increases costs if you have a relatively small number of developers who span multiple services - needing a login for each workspace.

Not sure I like the idea of using AMP to offset this limitation. AMP alerts aren't free and Grafana provides a better user experience for more people which is quite crucial when it comes to observability.

Is there an ETA on when this will be either addressed or rejected?

My suggestion would be to set a default limit of 100 with a hard limit of 1000. That way you're still encouraging people to make at least some effort in alert efficiency without becoming a blocker.

twellspring commented 6 months ago

@VermaPriyanka Thanks for the suggestion of using Managed Prometheus for alert rules. It seems like you are suggesting

  1. Disable Grafana Alerts
  2. Create alerts in Prometheus Alert Manager
  3. Create an alertmanager datasource so the Prometheus Alerts can be viewed in Grafana

This seems like a reasonable workaround to the 100 limit. After reading the linked article https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-Ruler.html, I can setup the alertmanager datasource and see the prometheus alerts in Grafana. But what is unclear is what Grafana does with that alert data. Does Grafana Alert Manager then try to send out its own alerts, resulting in duplicate alerts being sent? If so how do we prevent this from happening?

VermaPriyanka commented 6 months ago

@twellspring Grafana only provides the visualization of rules and firing alerts, all the processing is handled by AMP ruler and alertmanager, so no duplicate alerts that you would see with Grafana managed alerts.

tb00-cloud commented 5 months ago

@VermaPriyanka Are you able to provide us with an update on this request at all? I'd be happy to share some more on our use case if that's helpful?

VermaPriyanka commented 5 months ago

@tb00-cloud We are looking into the limit increase request. However, AMP alerts are not charged per alert, but for the queries you make to AMP. If the same alerts are defined from Grafana for data-sources such as AMP, Cloudwatch, you would actually see higher query costs due to the way Grafana's HA is implemented , i.e. it does not deduplicate query evaluations. Read here for more details. That said, we understand that Grafana provides a more friendly interface to manage these alerts and addresses alerting needs for many other data sources. For those use-cases, we are looking at safely increasing Grafana alert limits, so a high number of alerts does not degrade your visualization experience.