Open kirillyu opened 11 months ago
@kirillyu Thank you for the feedback and we glad that the project fits the need.
We plan to add integration with Grafana On-Call for Escalation and Acknowledgement. On-Call is an active project and I believe we can improve it collaborating with the team.
BI Engine (Alerting) supports Grafana HA out of the box using LBA for performance and redundancy. Supporting distributed cron jobs and API requests are on our roadmap.
We aim to solve this kind of visibility issues and provide easy and clean interface for Alerting in Grafana.
Thanks for the quick and detailed answer. It is not entirely clear what LBA means in this context. Can you expand on the term?
@kirillyu BI Engine has totally different approach than native Alerting. It designed to run in a separate Docker container, when native Alerting runs as a part of the Grafana container or installation.
Engine connects to the Grafana using URL and Service token to get panel/query configuration and run query. Following Grafana HA best practices (https://grafana.com/docs/grafana/latest/setup-grafana/set-up-for-high-availability/) it's recommended to use LBA/Proxy, which you specify in the GRAFANA_URL
.
##
## Service Account
## - Viewer permission is required to access dashboards
## - Editor permission is required to access dashboards and add Annotations
##
GRAFANA_TOKEN=SERVICE-ACCOUNT-TOKEN
##
## Grafana HTTP API
##
GRAFANA_URL=http://grafana:3000
I hope it's clear. We will add diagram in the documentation for the upcoming release.
Got it, but not totally clear. This article is about grafana HA, and I already have it. Can your container be deployed as a cluster with event exchange for dedublication?
@kirillyu As I replied, supporting distributed cron jobs and API requests are on our roadmap.
We won't use deduplication, we will use distributed locking to prevent starting same alerts (cron jobs).
I had to make sure this is what this is about, thanks!
We created diagrams to illustrate our HA concept.
@kirillyu We published a new blog post and a hands-on tutorials to highlight our latest development: https://volkovlabs.io/blog/big-1.6.0-20240117/
We are looking for the feedback and would be interested to hear your thoughts. Send me email to mikhail at volkovlabs.io to schedule.
@mikhail-vl Hello! I subscribed to updates and was one of the first to watch. As a standalone solution, your tool is very cool. it reminds me of the old alerts in Grafana, with a new set of conveniences that many people were missing. but when I think about its practical benefits for myself, I come to a dead end. I will give an example of reasoning:
See swagger API release! Cool
@kirillyu Yes, Swagger added in v1.7.0 and in the documentation: https://volkovlabs.io/big/api/.
Also, in v1.7.0 we implemented distributed alert scheduling. Each scheduler assigns alerts independently. If one of the engines dies, the rest will take a load. BI(G) supports HA at all levels now: https://volkovlabs.io/big/high-availability/
@kirillyu Thank you for sharing, that's impressive!
I would like to hear about your experience with OnCall when ready. Integration with OnCall is next on our list to implement together with increasing test coverage and small features before the official release.
@kirillyu Yes, Swagger added in v1.7.0 and in the documentation: https://volkovlabs.io/big/api/.
Also, in v1.7.0 we implemented distributed alert scheduling. Each scheduler assigns alerts independently. If one of the engines dies, the rest will take a load. BI(G) supports HA at all levels now: https://volkovlabs.io/big/high-availability/
Now it's for BIG systems :)
I'm very deep into incident management. Grafana OnCall is actually poorly suited for this task. The paradox is that she has everything for this. This idea was inspired by the BIG tool. Grafana has very wide possibilities, which need to be pulled out of it with great difficulty. It's the same with OnCall. Therefore, without further details, it seems to me that OnCall is better used as an engine, rather than an integration. Make an interface on top of it. What's wrong with it? Lack of flexibility! At the input, you set the conditions under which you need to trigger a certain escalation flow. So, for example, if you want to trigger two commands at once based on some alert, then it is very difficult to do this. Let's move on. The escalations themselves are also complete hardcode and they do not work with variables. I can’t make one universal escalation, the simplest one, where I simply specify the command that needs to be notified in the form of a variable that is taken from a trigger or from a metric label. For each command and for each unique variant of escalation flow, a new escalation is made. I have a huge number of commands, it’s just expensive to implement. The last case is the UI/UX of the “incident” itself. From there I can request a specific person or team and look at the resolution notes along with the status history. But in my opinion this is just terribly limited functionality, it seems I should be able to send an alert again, manually increase the escalation step, make an announcement/call to managers, or send out a conference call to engineers, where they need to go to for a solution, integrate this with chatops, you can receive information immediately from the chats, the incident manager will thereby receive contact with the end engineers, and finally, just convert this into post mortem after the incident and that’s it - then this would be a tool. And the tool itself for resolving incidents could provide at least the MTTR metric, not to mention the other key ones shown. We fill out the rules themselves in the form of yaml, the rotation of duty officers and their communication options, too, OnCall does not support this well either. Probably without going into unnecessary details, I’ll dwell on this. At the same time, OnCall has both acknowledgment and rotation of calendars on which you can send alerts and, most importantly, escalation steps in case the one who initially received the alert failed.
It turned out chaotic, but this is the only thing about OnCall for now :)
Привет, это просто очень крутой проект, моя команда очень ищет нечто подобное на рынке, что совмещало бы в себе простоту, функциональность и гибкость. Паттерн использования:
Совсем плохо с правилами эскалации, костылями это не закрыть. Механизм такой - если на алерт никто не реагирует то нужно его отправить в другую команду. Команды определяются по календарям. Тут можно сделать что есть grafana on-call, но там нельзя параметризовать команду/календарь - она всегда захаркожена. Крайне топорное и нефункциональное решение.
Отдельно есть вопрос не касаемо развития функционала: Есть ли возможность инсталляции HA? Кластеризация или обмен алертами, дедубликация? У нас больше 15 датацентров, больше ста команд, хочется дать всем возможность юзать алерты, с понятным интерфейсом, устойчивые. Сделать так чтобы саппорты всех команд имели общую панель из которой могли бы управлять инцидентами основанными на алертах из любой части бизнеса, быстро определяли команду