aws / amazon-managed-grafana-roadmap

Amazon Managed Grafana Roadmap
Other
60 stars 4 forks source link

Notification deduplication for Unified Alerting #47

Closed justinbwood closed 1 month ago

justinbwood commented 1 year ago

Per the AWS Managed Grafana docs on migrating classic alerts to Grafana alerting, multiple notifications are sent when using Grafana-managed alerts.

I would like to see Grafana's high availability alerting enabled so that notifications are properly deduplicated, as it's a bit frustrating to receive Slack notifications in triplicate when using Unified Alerting.

Thanks!

atze234 commented 1 year ago

I also like to see this. Really annoying with these three messages per Alert... I filed an issue over at grafana, but it seems like theres something wrong with amazon managed grafana config.

https://github.com/grafana/grafana/issues/68652

bradlet commented 1 year ago

I've also been running into this issue. Opened a support ticket w/ AWS and the result was basically reflecting the doc that was linked in this comment. It seems like really bad UX to spam out alerts like this... I'd be interested to hear what workarounds others have used; I'm in the process of migrating over to managing the alerts using an external alert manager, Prometheus AlertManager, instead. Would be nice to be able to provision the alert rules in Grafana though!

atze234 commented 1 year ago

As a workaround im using a Dynamodb and a Message hash in my Lambda that parses SNS. Like here:

https://gist.github.com/atze234/60dbef2991e08aba93b875c73578cf41

Also i set this in delivery_policy so that there is enough time to write to the db.

    "defaultThrottlePolicy": {
      "maxReceivesPerSecond": 1
    },
RphCos commented 1 year ago

This really is needed, since the "Classic" alerting is supposedly going away soon. It makes using Slack or Pagerduty impossible when monitoring large workloads, especially since classic alerts do not allow for template variables.

brc commented 1 year ago

+1

chr2che commented 1 year ago

is there any ETA for this please?

andrzej-mega commented 1 year ago

Spoke to AWS team about this today. They gave an "estimate" of Q1 2024 with possibility it might be as late as Q3 2024. According to them it's not a high priority issue for them and there are other issues they need to work on before that happens.

My biggest issue with it is that with Grafana managed service - alerting is advertised as a service feature.

I guess paying customers don't get a working feature until AWS deemed it worth fixing it...

kevdonde commented 1 year ago

We are also experiencing this issue. This is a primary feature of the service, and it is extremely disappointing that Amazon doesn't prioritize primary features of its products. We have waited for 1.5 years for Amazon to make 9.4 available in AMG so that we could use the alerting that is part of 9.4. Alerting is the only feature of 9.4 that we needed. It was/is the biggest reason to upgrade to 9.4. Now, we might further delay upgrading until as late Q3 2024 making it more than 2.5 years.

The purpose of the above rant is to add my vote to the priority of this issue.

webertrlz commented 8 months ago

+1

michael-ortiz commented 8 months ago

@VermaPriyanka do we have any updates on this and when should we expect a fix? This is really important to us!

amorphic commented 8 months ago

FYI @VermaPriyanka this is a showstopper for us. We considered various solutions for providing an observability service to our engineering teams and settled on Managed Grafana expecting it to Just Work. Now after a significant investment of resources to get set up and put processes in place, we've hit this bug which renders the service unfit for use. Alerting is core functionality and we cannot expect other teams to accept all of their alerts appearing 3x in Slack!

We would really appreciate a fix for this ASAP or at the very least an ETA on a fix and a standard workaround until the fix arrives.

sukoneck commented 8 months ago

workaround while we're waiting https://github.com/flashbots/prometheus-sns-lambda-slack

VermaPriyanka commented 8 months ago

Thank you all for the patience and for sharing workarounds. We understand that this is an important issue to solve and are working towards the same.

avpjanm commented 8 months ago

+1

magnowest commented 6 months ago

AWS released Grafana 10.4 yesterday, and it's still an issue.

Strangely, this was their response to the alerting in HA issue.

https://docs.aws.amazon.com/grafana/latest/userguide/v10-alerting-explore-high-availability.html

image
lorelei-rupp-imprivata commented 6 months ago

AWS released Grafana 10.4 yesterday, and it's still an issue.

Strangely, this was their response to the alerting in HA issue.

https://docs.aws.amazon.com/grafana/latest/userguide/v10-alerting-explore-high-availability.html

image

Yeah this is the WORST bug, I am not even sure how they can release with this issue, its been a year now, we are still stuck on the old legacy alerts because of this. That documentation almost suggests they won't fix this and its working as they designed it

VermaPriyanka commented 6 months ago

Thank you for voicing this concern. We are working towards a fix for the duplicate notifications issue in version 10. The description here explains the current workings of Grafana alerting, which implies rules are evaluated per HA instance. We are working towards solving this in 2 steps - focusing on solving the duplicate notifications first and to eliminate duplicate evaluations in the long term. We understand this has been a long wait, and are working towards releasing a fix soon.

ff-pjha commented 6 months ago

Facing the same issue. Do you have any workarounds for slack?

Diondk commented 5 months ago

Thank you for voicing this concern. We are working towards a fix for the duplicate notifications issue in version 10. The description here explains the current workings of Grafana alerting, which implies rules are evaluated per HA instance. We are working towards solving this in 2 steps - focusing on solving the duplicate notifications first and to eliminate duplicate evaluations in the long term. We understand this has been a long wait, and are working towards releasing a fix soon.

How fast can we get an fix for this, we are currently setting up alerting and its a real pain to receive all alerts 3x...

ursuciprian commented 4 months ago

Thank you for voicing this concern. We are working towards a fix for the duplicate notifications issue in version 10. The description here explains the current workings of Grafana alerting, which implies rules are evaluated per HA instance. We are working towards solving this in 2 steps - focusing on solving the duplicate notifications first and to eliminate duplicate evaluations in the long term. We understand this has been a long wait, and are working towards releasing a fix soon.

any updates on this nasty ,,feature"?

bguruprasad commented 4 months ago

We are also facing the same issue and would really appreciate on how and when this will be fixed by aws. Do you have any fix ETA on this @VermaPriyanka ? when is the fix supposed to be released for managed grafana? I am currently on Grafana v10.4.1 and still see this issue on aws managed grafana.

flashguerdon commented 3 months ago

Hi @VermaPriyanka, Any update on this issue?

VermaPriyanka commented 3 months ago

This is shipping soon on Managed Grafana v10.4 workspaces. Folks who have implemented workarounds to avoid the multiple notifications, do you see any concern as this fix is shipped - any breaking experiences or impact to your alerting flow?

kevdonde commented 3 months ago

Will you be patching 10.4 in place? Are you releasing a new minor patch to 10.4? The above statement is slightly confusing because 10.4 has already shipped.

ingMor commented 3 months ago

Hi @VermaPriyanka , we are about to implement such a workaraound (detriplication on a FIFO-SQS-SNS-basis or with Prometheus). If this feature is shipping soon, it might not be worth it. So can you specify the "soon"-part of your post (and also @kevdonde 's question regarding the versioning)? Thanks in advance,

VermaPriyanka commented 3 months ago

@kevdonde @ingMor It will be in place for all 10.4 workspaces - new, existing or upgraded. If you have additional/more specific questions, you can send them via mail to aws-grafana-feedback@amazon.com.

william-kurosawa commented 3 months ago

This is shipping soon on Managed Grafana v10.4 workspaces. Folks who have implemented workarounds to avoid the multiple notifications, do you see any concern as this fix is shipped - any breaking experiences or impact to your alerting flow?

That's great to hear. We have been avoiding creating alerts on Managed Grafana and creating on Prometheus or Cloudwatch, but our idea is to centralize all on Grafana.

Looking forward for the release!

Dragotic commented 3 months ago

@VermaPriyanka when is this fix coming? Currently, Alerting is unusable due to spam of multiple alerts

ghost commented 3 months ago

Hi @VermaPriyanka ,

Could you provide an update on when the fix will be released? Any ETA or additional details would be appreciated.

Thanks!

webertrlz commented 2 months ago

I'm also standing by for the workspace fix.

Krishnakumar-Santhanam commented 2 months ago

Hello @VermaPriyanka, please let us know when the fix will be released. Today, we have updated AMG from 9.4 to 10.4 but the notification duplication issue persists, we tried setting different group_interval and group_wait options but no luck!

ahsanejazzz commented 2 months ago

Notifications are sent 3x and there is no support for Email or MS teams integration in contact points. The notification template does nothing. Thanks for "Amazon RUINED Grafana".

Diondk commented 2 months ago

Notifications are sent 3x and there is no support for Email or MS teams integration in contact points. The notification template does nothing. Thanks for "Amazon RUINED Grafana".

They didn't ruined anything, you can run Grafana in an EC2 and manage it yourself you have all the options you want. Only then you are responsible for maintenance. If you don't want that you are bound to the managed Grafana. But all options you just provided are marked in the documentation that its currently not supported by AMG

To get ontopic again, hopfully this will be released soon. and are we able to deduplicate the messages.

edit

BTW you can get email notifications through sns but be aware you get 3 emails for each alert, untill this issue is resolved.

1md3nd-impressico commented 2 months ago

+1 Real pain

Inquisitive1a commented 2 months ago

@VermaPriyanka : We are using Grafana-oss v11.1.3 , after upgrade from v11.0.0 we are facing this Triple firing issue ,when we can expect the solution?

VermaPriyanka commented 2 months ago

@Inquisitive1a This is a public roadmap for Amazon Managed Grafana. I'm unsure if you are using self-managed Grafana or Grafana Cloud.

VermaPriyanka commented 2 months ago

Thank you all for being patient for this update. Notifications have been sent to existing customers using Grafana alerts in Amazon Managed Grafana, about the release of an update that will prevent multiple notifications. Will share an update here, once this is available on all Amazon Managed Grafana v10.4 workspaces.

cageyv commented 2 months ago

Thanks for the update. We are getting these notifications in many projects. Customers will be happy. Let's expect 1-2 weeks to get this update for everyone :) And check your operational contact or default contact email. If you will the exact update date inside them.

jcquiles commented 2 months ago

According to our AWS Health Dashboard notifications, it looks like Sept 14th is the date to expect for the patch to eliminate multiple alerts, you must have a v10.4 workspace for the update.

Starting September 14, 2024, we will release an update that prevents multiple alert notifications sent to your alert destinations/contact points [2], from Grafana managed alert rules.

This update is only available for Amazon Managed Grafana version 10.4 workspaces. If you are running Grafana version 8.4 or 9.4, you must upgrade your workspace to Grafana version 10.4 to receive this update.

VermaPriyanka commented 2 months ago

Thanks all for sharing here. Would like to clarify, the release starts on 9/14, so it may be a couple of days from then for you to see the effect in your workspace, depending on which region you are in.

jcquiles commented 2 months ago

thats good to know! thanks @VermaPriyanka

Diondk commented 2 months ago

I can confirm that the fix is working and i am only getting 1 notification now per server.

1md3nd-impressico commented 2 months ago

@Diondk I am still getting 3 notification, should I have to make any changes to apply the patch ?

Diondk commented 2 months ago

@Diondk I am still getting 3 notification, should I have to make any changes to apply the patch ?

no thats not needed, but please note that the release started at 9/14. Could be a couple of days before you see the effect in your workspace.

1md3nd-impressico commented 2 months ago

@Diondk Which region are you using for your Workspace ? Since my workspace is in us-east-1 and there is currently no fix for it.

Dragotic commented 2 months ago

We are in eu-central-1 and still no fix either.

1md3nd-impressico commented 2 months ago

@VermaPriyanka Can you please confirm in which region is it deployed ?

Diondk commented 2 months ago

@Diondk Which region are you using for your Workspace ? Since my workspace is in us-east-1 and there is currently no fix for it.

We are in EU-WEST-1, there was also no fix for me to apply myself, it was fixed when i came in the office on monday.

Krishnakumar-Santhanam commented 2 months ago

@VermaPriyanka We are yet to receive any fix, we are on version 10.4 and running AMG in eu-west-1, but alerts are still getting triggered thrice! Any ETA on when other users will receive the fix?

VermaPriyanka commented 2 months ago

Thank you all for your patience. We understand the anxiety at this time and would like to inform that the update has been released for all new Amazon Managed Grafana version 10 workspaces, and the release for existing version 10 workspaces is in progress. We expect the update to be worldwide by next week. No action is required from the customers for this update. Advance notification stating that the release starts on 9/14 was sent out to inform customers about the upcoming change in alert notifications behavior.