Azure / Enterprise-Scale

The Azure Landing Zones (Enterprise-Scale) architecture provides prescriptive guidance coupled with Azure best practices, and it follows design principles across the critical design areas for organizations to define their Azure architecture
https://aka.ms/alz
MIT License
1.68k stars 953 forks source link

Feature Request - Merge in policy to create service health incident alerts #694

Closed mrajess closed 1 year ago

mrajess commented 3 years ago

I wrote this Policy:https://github.com/Azure/Community-Policy/blob/master/Policies/Monitoring/deploy-service-health-alert-incidents/azurepolicy.json

Which will deploy an Action Group, an Activity Log Alert for Service Health Incidents, and a Resource Group to contain them. I'd like to pull this Policy into the ESLZ, but am unsure of the following:

Thoughts?

jtracey93 commented 3 years ago

I think this is a great idea and something we should definitely have in ESLZ.

Creating its own resource group is fine in my opinion as long as it follows are naming for things like the ASC export policy.

As for where does it go. I think it is just another standalone definition in the policies.json file.

Other points:

But great idea and look forward to seeing a contribution soon. Shout if you need any assistance 👍

mrajess commented 3 years ago

Thanks for the responses, Jack! I'll take a look at the ASC export policy and if I have any questions on naming I'll circle back. To your other points.

It's creating an Action Group so you could technically accept an array of emails, I'd just have to modify the ARM template that's deployed to have a copy loop as each email is a separate notification/object.

Right now it just enables incidents for all regions and services on the subscription. What were your thoughts for overrides?

krnese commented 3 years ago

What's the intent of the alert and who will receive it?

jtracey93 commented 3 years ago

@mrajess Thanks! I think the ability to be able filter Azure Regions and/or Azure Services for each alert on assignment might be a nice idea. For example a customer might know they only ever deploy in North Europe & West Europe (controlled by Azure Policy of course 😎), then they may not want alerts for other regions, as it could be seen as "noise". The same goes for services.

@krnese This is for configuring Service Health alerts and incidents on all subscriptions, which is something everyone should do to ensure they are notified ASAP if any issues arise etc. - This is a great value add to ESLZ IMHO 👍

mrajess commented 3 years ago

Jack, the behavior for service alerts is that you'll only receive notifications for incidents affecting your resources. We don't emit service alerts to subscriptions unimpacted by a specific incident. I think we'd be better off leaving the alert to target all regions as it won't generate additional noise and limits the ability to miss an alert due to misconfiguration. Thoughts?

jtracey93 commented 3 years ago

Great point and with that I agree. And if a customer deploys their own then this policy wont take effect anyway 👍

Feel free to start drafting a PR for a policy following our general policy structure as you've seen. Also i would say we should include all of the Service Health Incident Types as per: https://docs.microsoft.com/en-us/azure/service-health/service-health-notifications-properties.

Also make the email addresses an array 👍

I also see a bit of thinking we need to do here, as to where does the policy get assigned and do we need to provide any additional guidance. For example if we assign to the Intermediate Root Management Group, that's great for coverage, however who should get the emails and potentially some will get emails for services/regions that dont impact them?

Have a think, keen to hear suggestions on this whilst I also mull it over.

daltondhcp commented 3 years ago

I agree with @jtracey93 that it would add value as long as we provide guidance on where it can/should be assigned to ensure that the alert will be received by the right team/individual. For landing zones, that likely means it has to be assigned at the subscription level.

jtracey93 commented 3 years ago

@mrajess Any updates on this based on our discussion above?

jtracey93 commented 1 year ago

Trigger ADO Sync 1

jtracey93 commented 1 year ago

Trigger ADO Sync 2

jtracey93 commented 1 year ago

@paulgrimley can we transfer this issue to the ALZ Monitor repo? I will move it now, as i think it should be there as you are addressing there (update i cant move until its public 👍)

paulgrimley commented 1 year ago

@jtracey93 lets leave it here as we plan to move this over in future, I have spoken with @mrajess and also working with SH PG on aligning. We have in our backlog to create a SH initiative that can be independently used by customers as part of our Baseline framework currently in https://aka.ms/alz/monitor/repo (now public as of today)

paulgrimley commented 1 year ago

@mrajess closing this issue as we now have a Service Health Initiative as part of https://aka.ms/alz/monitor/repo and plan to assess how to bring this into ALZ as part of initial deployments we can consider this closed @jtracey93 @jfaurskov