Grafana event handling in monitoring-integration

Tendrl / specifications

Tendrl specs go here

GNU Lesser General Public License v3.0

6 stars 16 forks source link

Grafana event handling in monitoring-integration #169

Closed GowthamShanmugam closed 6 years ago

GowthamShanmugam commented 7 years ago

Currently, tendrl is getting all the thresholds related alerts from collectd. But as per new change grafana is going to send all thresholds related alerts to tendrl. So the new alerting system should catch all alerts from grafana, and convert that alert into proper structure and send as a notification to the corresponding users.

Requirements:

Enable webhook endpoints such that the events generated by grafana can be consumed.
Process the received events to add tendrl context and metadata.
Pass the processed events to the node agent socket for transport.

There is no change in current alerting module.

GowthamShanmugam commented 7 years ago

@rishubhjain @anmolbabu @r0h4n @anivargi

rishubhjain commented 7 years ago

@GowthamShanmugam I guess right now we are just focusing on getting alerts from grafana instead of other sources(collectd) and providing it to our existing alerting system(via etcd or directly).

GowthamShanmugam commented 7 years ago

@rishubhjain yes to receive and convert that alert structure in tendrl we need some change, that is we need endpoint to receive grafana alert and change grafana alert structure to already existing alert structure. And we have to find dependent alert for that particular alert also, so existing alert system need some change to adopt this features.

rishubhjain commented 7 years ago

@anmolbabu @brainfunked are we supposed to address the issue of alerts not getting grouped and sent in the upcoming release? should we include it in the specs/docs?

GowthamShanmugam commented 7 years ago

@brainfunked

GowthamShanmugam commented 7 years ago

As per discussion with @brainfunked endpoint for the alerting is: localhost and port should be unique so it should be in range 10000s

rishubhjain commented 7 years ago

@brainfunked continuing the discussion from the channel, as per our discussion on introducing resilience in the system we would need to handle node failures and component failures. If the system is on the same node then node failure would mean grafana failing as well and recovering from that would require a different approach and wouldnt be included in alerting system. Am I right?

rishubhjain commented 7 years ago

@brainfunked I feel our spec must also include setting up initial Notification channel (custom web hook) in grafana

brainfunked commented 7 years ago

@rishubhjain yes.

GowthamShanmugam commented 7 years ago

@brainfunked @shtripat @anmolbabu @cloudbehl alerts which i handled from grafana are: Node:

CPU utilization alert
Memory utilization alert
Swap utilization alert Cluster:
1. Cluster utilization alert
2. Brick utilization alert
3. Volume utilization alert

What are all the utilization alerts other than this i have to handle from grafana?

Now i am considering only alerting and ok alerts, no_data alert is not required to sent right shall we choose option keep last state for this?

GowthamShanmugam commented 7 years ago

@rishubhjain is spec is updated for this?

rishubhjain commented 7 years ago

@GowthamShanmugam I am not sure, but probably not. Is there a need to update spec?