Uninett / Argus

Argus is an alert aggregator for monitoring systems
GNU General Public License v3.0
18 stars 13 forks source link

Reporting errors in Argus itself #88

Open hmpf opened 4 years ago

hmpf commented 4 years ago

The types of errors that argus can report about itself, for instance: Failure to send a notification because the notification-endpoint isn't answering (email server down, say), should be reported as a incident.

This means we need a SourceType "argus" and a SourceSystem representing the host argus is running on. Named "self" maybe? "me"? I suspect hostname would be tricky. Also, a function/method argus can use to write to the incidents-table, with SourceType/SourceSystem locked.

(This is very nice, because we can dogfood the system using itself, triggering errors in argus in order to have incidents turn up in argus :) )

hmpf commented 4 years ago

These objects should be get_or_created early and easily and often. Maybe a management command named "setup" or "verify" or something, or get_or_created on each use of the function.

The first use of the function could be when setting up the system for the first time, a "Hello, World" incident, low severity!

hmpf commented 4 years ago

There now is a way to auto-create an argus user/source/source type. What's left is to create a suitable incident every time argus complains about something in its logs.

katsel commented 3 years ago

This feature sounds very useful!

Still, I am a bit worried that, in certain cases, Argus might overload with its own error messages if this were implemented naively. Where an error triggers an incident, which matches a filter, is sent out by mail, which causes another error that triggers another incident, ad infinitum: Congrats on DoSing yourself and/or taking a whole Argus instance down.

So, two requirements that should be met before implementation 1) Needs a clear, exhaustive, written spec which errors can cause incidents and which do not. 2) A mechanism to prevent choking on its own incidents. Some filtering message queue, or another mechanism for rate limiting.

katsel commented 3 years ago

Removing "good first issue" tag for aforementioned issues. The actual code change may be easy to make, but it seems wise to reduce the threat vector a bit before tackling an implementation.

katsel commented 3 years ago

Another approach would be sending a notification through Argus without creating an incident. Details to be discussed later.

johannaengland commented 6 months ago

This came up again in #760.