Keystone-Technologies / keystone-technologies.github.io

1 stars 0 forks source link

Get a handle on notifications and alerts. #51

Open codykniffen opened 8 years ago

codykniffen commented 8 years ago

Problem

Applications and appliances that send out notifications are great, but handling those notifications is often not. For example in Veeam (a backup solution we currently use) we can configure notifications on success, warnings, and failures. Said notifications will go to "backups@", "alerts@", or "support@" to be processed for ticket creation. Sometimes this creation is automated, sometimes it's manual.

An issue we've encountered recently is if there are several alerts at once from the same source the email or ticketing system will flag them and they may not always get through.

There's also the problem of configuring them in the first place. Different alerts go to different places and it turns into a documentation nightmare.

Proposal

We need to come up with some sort of centralized management routing protocol or system for these notifications. Software is smart when configured properly, what if we used a centralized system to send all alerts / notifications to that would then intelligently process where they are from and summarize them in to a usable digest for processing as one ticket per site.

Side note - despite @s1037989 not sensing the urgency in this post because I didn't use italics or underlining, it's there baby! This is an important issue that needs a well thought out solution. :)

CalebAlbers commented 8 years ago

I believe PagerDuty is exactly what you are looking for. I have looked at implementing it for Keystone, however without direct integration with our RMM tool, it gets a bit convoluted.

As for the earlier points, we do have a standard that all email-based alerts should follow. The only reason there are exceptions is because we either have not gotten to the device to change settings or the device was recently set up without adhering to the standard.

Could you elaborate internally/through email on the ticketing system rejecting too many email alerts? That's a big concern for me if it is happening.

On Tue, Feb 16, 2016, 10:54 AM Cody Kniffen notifications@github.com wrote:

Problem

Applications and appliances that send out notifications are great, but handling those notifications is often not. For example in Veeam http://www.veeam.com (a backup solution we currently use) we can configure notifications on success, warnings, and failures. Said notifications will go to "backups@", "alerts@", or "support@" to be processed for ticket creation. Sometimes this creation is automated, sometimes it's manual.

An issue we've encountered recently is if there are several alerts at once from the same source the email or ticketing system will flag them and they may not always get through.

There's also the problem of configuring them in the first place. Different alerts go to different places and it turns into a documentation nightmare. Proposal

We need to come up with some sort of centralized management routing protocol or system for these notifications. Software is smart when configured properly, what if we used a centralized system to send all alerts / notifications to that would then intelligently process where they are from and summarize them in to a usable digest for processing as one ticket per site.

Side note - despite @s1037989 https://github.com/s1037989 not sensing the urgency in this post because I didn't use italics or underlining, it's there baby! This is an important issue that needs a well thought out solution. :)

— Reply to this email directly or view it on GitHub https://github.com/KeystoneIT/keystoneit.github.io/issues/51.

s1037989 commented 8 years ago

This really does sound like an important issue. Notifications are critical to anyone, especially our business? Would you say that notifications are, in fact, the driving force behind our business? Imagine a world without notifications. Would we exist, as a business? Keep in mind, clients calling us up to notify us of a problem is still a notification... We aim to be proactive by not depending on our clients to alert us, by instead installing little tiny robots that will detect the same problems (and more) as their human counterparts...

So, with that being said, reliable and trustworthy and useful notifications are imperative to how we do business and do it effectively and efficiently.

So these devices you speak of... They have these "robots" built into them to send notifications of all varieties. Are there notifications that we should know about besides something being down? What about the device itself -- surely it can't send a notification itself when it is itself that is indeed down. How would you notify for these alerts?

How are the alerts capable of being delivered? You mentioned some email addresses. Is that the only notification medium that they support, or is that the only medium we leverage? Is email the best, or do we use that simply because it's the easiest to setup and configure? Do these devices support other mediums? If so, why? Might these be more "advanced" notification systems that are for those users beyond the "just need a damn notification" stage?

It sounds like sometimes notifications for some unknown reason aren't delivered -- something about too many cause none. Would like to know more about that? Do you have a deeper theory on this?

But other than that, it sounds like notifications are being delivered. Is there anything in particular that's insufficient about this? Do you need simply the alert, or could the concept of alerts be enhanced in someway? Is email notifications like novice's approach at notifications, and more sophisticated uses would use more sophisticated mechanisms? Would would they be doing that we aren't / can't?

As for configuring... Even with central management routing of notifications, we still need to touch every device. How can we know what's been done and what's still to do?

Can we be sure that every device would support the central management mechanism? How?

You talk about intelligent processing... What do you have in mind there? What kind of intelligence could be / should be applied to these notifications? How would that help and who would it help?

codykniffen commented 8 years ago

Could you elaborate internally/through email on the ticketing system rejecting too many email alerts? That's a big concern for me if it is happening.

Hoping we can get some feedback from @bennolen but apparently this is a thing. I'm not sure if it's the ticketing portion or the part where it comes in to Google Apps but it's my understanding that sometimes the alerts come in so rapid fire that they're being ignored / knocked down. Could be wrong.

bennolen commented 8 years ago

Couple points along the way to address:

Is email the best medium? Probably not, but it's what we have to work with. An installed agent that phones home every minute would be better. But appliances don't support this. SNMP is great and all, but unless it's a Cisco or HP device, most likely the SNMP definition for this device isn't included in KMS (or whatever else we find) requiring us to jump through a lot of hoops to make it work. And even at that, SNMP may not be supported by software we need to monitor. Email is the one thing that is supported by 95% of the appliances/devices we need to monitor.

One catch of monitoring by email that I've already mentioned before: we can monitor for partial failures via email (missed backup, failing hard drive, etc), but we can't monitor for total failures via email (backup server dead, failed array on NAS, etc) because the device is offline and can't sent the email. We need the system to be programmable to send us an alert on a failed check in. Again, this may not be possible for all devices as it would require the device to send a success email daily and not all devices have a success email to send. What is the NAS going to email to the system that says it's still online?

The last point I had for now was to @codykniffen and @CalebAlbers. An example being, when Veeam has a failed backup, it's likely because the NAS fails. Right now, Veeam fires off about 25 emails in total with different alerts of the backups failing (backup failed, destination unavailable, etc, etc). Because of the influx of alerts coming in at once, they get caught in the Gmail spam filter for support@ and never make it in as a ticket. If we had an intermediary system catching and parsing the email, it would turn around and create one (critical?) ticket instead of no ticket, or possibly 25 different tickets.

s1037989 commented 8 years ago

Is email the best medium? Probably not, but it's what we have to work with. An installed agent that phones home every minute would be better. But appliances don't support this. SNMP is great and all, but unless it's a Cisco or HP device, most likely the SNMP definition for this device isn't included in KMS (or whatever else we find) requiring us to jump through a lot of hoops to make it work. And even at that, SNMP may not be supported by software we need to monitor.

SNMP Traps... that is a good idea to use. True, not fully supported, but where it is supported, it's probably the best way to trigger notifications because the MIB describes the issue. Rather than parsing for words like ALERT and CRITICAL, you check the standard MIB definition for the device for what it's trying to tell you.

I agree, email is not the best medium. It's the least sophisticated notification system. We use it because 1) it is most supported and 2) it's super easy to set up. So why do devices offer other notification systems such as SNMP and Syslog? Because they're more sophisticated.

Our business is notifications... we should step up our sophistication level from the most basic email to something more capable and manageable.

Email is the one thing that is supported by 95% of the appliances/devices we need to monitor.

Syslog is another medium. What do you suppose is the percentage support of that protocol? Syslog is like a hybrid of email and smnp in terms of sophistication. Syslog offers "facilities" and "levels". You might receive a syslog message on facility "san" with a level of "warning". That gives you a fair amount of sophistication to count on.

How can we get 100% notification support by all appliances/devices? By unifying all of the above notification mediums into one. With a central system, we can support n mediums. It could receive emails, Traps, syslogs, and anything else. The central system would collect and manage all alerts. It would allow a tech to configure the most sophisticated notification mechanism available on the appliance/device/application.

Along with the centralized repository of all notifications, you have some serious ability to make one sophisticated notification system to rule them all. You can strip out all the HTML and just grab the actual message, thereby allowing you to create a uniform messaging style across all devices. You can queue notifications for delivery, e.g. any notification with a level < "warn" should get queued up for daily digesting. You can also compare incoming notifications against previous ones. Was a notification for this issue just delivered to a new ticket? Stamp this notification with that ticket #. Ticket details can include a link to the originating notification such that any forthcoming notifications that get tagged with that ticket # are all shown at the same link.

We need the system to be programmable to send us an alert on a failed check in.

I believe that a central intelligent notification system can handle this. 1) using Artificial Intelligence and 2) through manual tagging. You could filter through devices and tag anything that you want to see every day and anything that you don't want to see. The central system would act accordingly.

(2) would be easier but more tedious. At least it's a set it and forget thing. (1) is a more sophisticated approach to eventually incorporate so that you could keep pointing more and newer devices at the system and it'll eventually just work things out.


Picture all of this. Is this awesome? Or is this not really all that handy? What makes it not handy? What makes it not useful?

If it worked for us, would other companies around the world (MSP or not) be interested in such a system? Companies could create a Keystone Knotifications account and just point their devices at it. Email, SNMP, Syslog... It'll just collect them all and handle appropriately.

bennolen commented 8 years ago

Picture all of this. Is this awesome? Or is this not really all that handy? What makes it not handy? What makes it not useful?

All of the above, but regardless necessary. Like you said, we are in the notification business. We need to be constantly improving our notifications system and process.

s1037989 commented 8 years ago

All of the above, but regardless necessary. Like you said, we are in the notification business. We need to be constantly improving our notifications system and process.

:+1: Well said.

s1037989 commented 8 years ago

I feel like I (personally) have 3 requirements. Feel free to agree or disagree. Use them and abuse them.

  1. We need centralized notifications. All, and I mean all, notifications must centralize to a single target in which we can apply uniform rules, workflows, analysis, etc across the board. We need to be able to manage notifications for all clients, devices, applications, etc from a single point.
  2. We need to support all, and I mean all, systems of all types. If we support it, we must know about. We must be able to support all systems in a uniform way.
  3. We need an intelligent system for intelligently handling all notifications with any level of intelligence that's currently and futurely possible. We need to support snoozing, anomalies, maintenance windows, thresholds, and every other type of intelligence you can imagine.
s1037989 commented 8 years ago

Action items!

  1. Keith: design a naming convention for email addresses
  2. Montez: implement a datastore
  3. Stefan: provide a store-and-forward email alert gateway
s1037989 commented 8 years ago

Orange Box (project) Orange Box (dev site)