Incident management tool

kirillyu commented 7 months ago

Привет, это просто очень крутой проект, моя команда очень ищет нечто подобное на рынке, что совмещало бы в себе простоту, функциональность и гибкость. Паттерн использования:

У графаны есть алерты, у них есть лейблы в виде команд и критичности. Хочется отслеживать частые алерты, команды которые их не обрабатывают, кому сколько летит критов или варнов. Мы это можем делать благодаря вашему плагину графана апи, но есть проблема в виде пункта 2 и пункта 3
Окей, алерты частые или их много критичных, мы хотим алертить на это тоже, а использование ряда ваших плагинов не позволяет это делать. Поэтому этот инструмент уже очень полезен
При этом алерт может быть частым и критичным потому что проблема решается и отсюда вопрос, есть ли функционал acknowledgment, а точнее некий assign алерта на кого-то кто отправил обратный вебхук в апи вашего плагина. Сейчас мы делаем все на коленке: json exporter собирает графановские алерты через ее апи и закидывает их в метрику, далее видим что алерты частые - алертим в другую команду, команды имеют функционал акноледжмента, но это по сути апдейт метрики алерта изменением лейбла.
Совсем плохо с правилами эскалации, костылями это не закрыть. Механизм такой - если на алерт никто не реагирует то нужно его отправить в другую команду. Команды определяются по календарям. Тут можно сделать что есть grafana on-call, но там нельзя параметризовать команду/календарь - она всегда захаркожена. Крайне топорное и нефункциональное решение.

Отдельно есть вопрос не касаемо развития функционала: Есть ли возможность инсталляции HA? Кластеризация или обмен алертами, дедубликация? У нас больше 15 датацентров, больше ста команд, хочется дать всем возможность юзать алерты, с понятным интерфейсом, устойчивые. Сделать так чтобы саппорты всех команд имели общую панель из которой могли бы управлять инцидентами основанными на алертах из любой части бизнеса, быстро определяли команду

mikhail-vl commented 7 months ago

@kirillyu Thank you for the feedback and we glad that the project fits the need.

It's great to hear that Grafana HTTP API Data Source works for your use case.
Grafana native Alerting does not support frontend data source like HTTP API. BI Alerting will work with any available data sources.
We can look into adding Acknowledgement in future releases.
We plan to add integration with Grafana On-Call for Escalation and Acknowledgement. On-Call is an active project and I believe we can improve it collaborating with the team.

BI Engine (Alerting) supports Grafana HA out of the box using LBA for performance and redundancy. Supporting distributed cron jobs and API requests are on our roadmap.

We aim to solve this kind of visibility issues and provide easy and clean interface for Alerting in Grafana.

kirillyu commented 7 months ago

Thanks for the quick and detailed answer. It is not entirely clear what LBA means in this context. Can you expand on the term?

mikhail-vl commented 7 months ago

@kirillyu BI Engine has totally different approach than native Alerting. It designed to run in a separate Docker container, when native Alerting runs as a part of the Grafana container or installation.

Engine connects to the Grafana using URL and Service token to get panel/query configuration and run query. Following Grafana HA best practices (https://grafana.com/docs/grafana/latest/setup-grafana/set-up-for-high-availability/) it's recommended to use LBA/Proxy, which you specify in the GRAFANA_URL.

##
## Service Account
## - Viewer permission is required to access dashboards
## - Editor permission is required to access dashboards and add Annotations
##
GRAFANA_TOKEN=SERVICE-ACCOUNT-TOKEN

##
## Grafana HTTP API
##
GRAFANA_URL=http://grafana:3000

I hope it's clear. We will add diagram in the documentation for the upcoming release.

kirillyu commented 7 months ago

Got it, but not totally clear. This article is about grafana HA, and I already have it. Can your container be deployed as a cluster with event exchange for dedublication?

mikhail-vl commented 7 months ago

@kirillyu As I replied, supporting distributed cron jobs and API requests are on our roadmap.

We won't use deduplication, we will use distributed locking to prevent starting same alerts (cron jobs).

kirillyu commented 7 months ago

I had to make sure this is what this is about, thanks!

mikhail-vl commented 6 months ago

We created diagrams to illustrate our HA concept.

BI-HA BI-workflow

mikhail-vl commented 5 months ago

@kirillyu We published a new blog post and a hands-on tutorials to highlight our latest development: https://volkovlabs.io/blog/big-1.6.0-20240117/

We are looking for the feedback and would be interested to hear your thoughts. Send me email to mikhail at volkovlabs.io to schedule.

kirillyu commented 5 months ago

@mikhail-vl Hello! I subscribed to updates and was one of the first to watch. As a standalone solution, your tool is very cool. it reminds me of the old alerts in Grafana, with a new set of conveniences that many people were missing. but when I think about its practical benefits for myself, I come to a dead end. I will give an example of reasoning:

My company now has more than 12 thousand alerts, some of them in grafana, some in the alert manager, some in my own solution and some in Zabbix. they all contain their own routes, severity levels and labels. I have no idea how to migrate them.
in continuation of the first point, among the alerts there are very valuable ones, to protect them we use IaaC, or instrument the API via teraform and also store them as code with all possible benefits such as code owners and inheritance
if I can leave other tools with alerts as they are and just start using your tool - it would be very convenient to have universal support for any sources of alerts for monitoring their status, annotations, links to charts and other features that BIG provides for analytics Globally, unfortunately, right now I spend most of my time on functionality that would allow those on duty to maintain calendars, there would be acknowledgment for alerts, I would have escalation steps, and most importantly, there would be an interface for all of this. I'm trying to set up Grafana OnCall in the absence of other adequate tools. Immediately after solving this problem, I will deal with the alerting task. But in any case, I am very interested in BIG and will try to participate in its development

kirillyu commented 5 months ago

See swagger API release! Cool

mikhail-vl commented 5 months ago

@kirillyu Yes, Swagger added in v1.7.0 and in the documentation: https://volkovlabs.io/big/api/.

Also, in v1.7.0 we implemented distributed alert scheduling. Each scheduler assigns alerts independently. If one of the engines dies, the rest will take a load. BI(G) supports HA at all levels now: https://volkovlabs.io/big/high-availability/

mikhail-vl commented 5 months ago

@kirillyu Thank you for sharing, that's impressive!

I would like to hear about your experience with OnCall when ready. Integration with OnCall is next on our list to implement together with increasing test coverage and small features before the official release.

kirillyu commented 5 months ago

@kirillyu Yes, Swagger added in v1.7.0 and in the documentation: https://volkovlabs.io/big/api/.

Also, in v1.7.0 we implemented distributed alert scheduling. Each scheduler assigns alerts independently. If one of the engines dies, the rest will take a load. BI(G) supports HA at all levels now: https://volkovlabs.io/big/high-availability/

Now it's for BIG systems :)

kirillyu commented 5 months ago

I'm very deep into incident management. Grafana OnCall is actually poorly suited for this task. The paradox is that she has everything for this. This idea was inspired by the BIG tool. Grafana has very wide possibilities, which need to be pulled out of it with great difficulty. It's the same with OnCall. Therefore, without further details, it seems to me that OnCall is better used as an engine, rather than an integration. Make an interface on top of it. What's wrong with it? Lack of flexibility! At the input, you set the conditions under which you need to trigger a certain escalation flow. So, for example, if you want to trigger two commands at once based on some alert, then it is very difficult to do this. Let's move on. The escalations themselves are also complete hardcode and they do not work with variables. I can’t make one universal escalation, the simplest one, where I simply specify the command that needs to be notified in the form of a variable that is taken from a trigger or from a metric label. For each command and for each unique variant of escalation flow, a new escalation is made. I have a huge number of commands, it’s just expensive to implement. The last case is the UI/UX of the “incident” itself. From there I can request a specific person or team and look at the resolution notes along with the status history. But in my opinion this is just terribly limited functionality, it seems I should be able to send an alert again, manually increase the escalation step, make an announcement/call to managers, or send out a conference call to engineers, where they need to go to for a solution, integrate this with chatops, you can receive information immediately from the chats, the incident manager will thereby receive contact with the end engineers, and finally, just convert this into post mortem after the incident and that’s it - then this would be a tool. And the tool itself for resolving incidents could provide at least the MTTR metric, not to mention the other key ones shown. We fill out the rules themselves in the form of yaml, the rotation of duty officers and their communication options, too, OnCall does not support this well either. Probably without going into unnecessary details, I’ll dwell on this. At the same time, OnCall has both acknowledgment and rotation of calendars on which you can send alerts and, most importantly, escalation steps in case the one who initially received the alert failed.

kirillyu commented 5 months ago

It turned out chaotic, but this is the only thing about OnCall for now :)

VolkovLabs / volkovlabs-bi-grafana

Incident management tool #9