Uninett / Argus

Argus is an alert aggregator for monitoring systems
GNU General Public License v3.0
18 stars 13 forks source link

Add a severity level to Incident #70

Closed lunkwill42 closed 3 years ago

lunkwill42 commented 4 years ago

Version 1.0 of Argus was released without an Incident attribute for "Severity", as this feature was underspecified.

After discussing internally how our service desk should operate, mostly according to ITIL principles, we have arrived at a simple design decision for Argus.

To avoid overloading terminology too much, we shall simply introduce a new attribute to the Incident model to loosely map to the concept of severity:

level shall be an integer value, in the range 1 through 5. 1 is the highest level, used for critical incidents. 5 is the lowest level. For compatibility purposes, the attribute shall not be mandatory (i.e. NULL values are allowed).

The 5 levels shall be assigned names for most human consumption. These names are subject to change, so ease of modification at a later stage must be considered. Each level must also feature a long-form description, which the frontend can use to help users understand the severity level properly.

The suggested level names for the initial implementation are:

5=Information 4=Low 3=Moderate 2=High 1=Critical

A null value may be termed "None" or "N/A".

How incidents are classified into these levels remains entirely up the API clients / glue services that report incidents from their corresponding source systems.

The severity level is only ever intended to be set by API clients, not including the frontend UI client.

As a side effect, notification filters must be changed to be able to filter on severity levels (greater than/less than/equals/not set). This also means that the incidents API endpoint needs a new argument to filter on levels as well. We should perhaps post these as separate issues to this one?

hmpf commented 4 years ago

You're mixing up priority and severity. Severity is how bad an incident is for a specific system (boxDown vs. linkDown, say). Priority is how fast it needs to be fixed. A glue service cannot know ahead of time what takes high priority, unless it already has that built in. Also, a priority is primarily decided upon by a human, per case. See #113.

So: we need a severity-field regardless, we could add a priority-field later. The important bit though is low-number == more serious.

Mapping python loglevels in I'd say:

Loglevel Severity
CRITICAL 1
ERROR 2
WARNING 3
INFO/DEBUG/other 4
lunkwill42 commented 3 years ago

So: we need a severity-field regardless, we could add a priority-field later. The important bit though is low-number == more serious.

Mapping python loglevels in I'd say: Loglevel Severity CRITICAL 1 ERROR 2 WARNING 3 INFO/DEBUG/other 4

Let's run this by USC as soon as they are up and running on the current prod version on their screens

katsel commented 3 years ago

Blocked. Needs talking with USC

katsel commented 3 years ago
lunkwill42 commented 3 years ago

After today's design meeting with the CNaaS team and our service desk, I've updated the description of this issue. The incident level names may still be up for discussion, but I suggest we remove the discussion label from this issue at this point, as implementation can start without this detail being settled (it's mostly useful in the UI, the integer values are more important)

katsel commented 3 years ago

A null value may be termed "None" or "N/A".

Chiming in here with my "data nerd" hat. 🎩

First of all, semantics of "N/A" vs. "None" vs. "NULL".

These are not interchangeable and differ in semantic meaning. (Fun fact: Languages specialised in data munching, like R, support all 3 values side by side.)

In a real world data modelling example, you would typically set a "N/A" value in two cases:

  1. this field is not applicable to this kind of entity (for example, a car does not have a gender)
  2. this value is not provided by the data source (but may be added later).

"None" means "there is none". For example, a land parcel can have an owner (or 2 or 3), or it can be unowned (owner = None). It still exists and None is a valid entry.

NULL typically means "invalid" and is used for errors (malformated spreadsheet cell that cannot be parsed correctly, read errors).

NULL values and hierarchical data spec

The problem

Handling NULL/None/NAs (NULLs for short) comes with its own set of complications, peculiarities and unintended side-effects. Introduction of NULLs should be very well justified and unavoidable, especially when using them on a field with an hierarchical data spec. Hierarchical meaning that, values range from 1 to 5 and are ordered: 1 is more severe than 2, 2 is more severe than 3, and so on. Greater than/lower than filters can be applied.

Example

We will filter for great than/lower than 2 on level. Where do NULL values fit in here? A NULL value creates unpleasant 'loopholes' in the data model. If you allow NULLs, it follows that

(number of incidents with severity > 2) + (number of incidents with severity <= 2) != (total number of incidents)

because incidents with level=NULL will not be matched by any of these filters. This is counter-intuitive both from a usability and mathematical perspective, and may lead to incidents accidentally being dropped from notification profiles.

Other solutions

How to circumvent those problems? Two possible solutions:

  1. Require a severity level (1-5) for each incident created, and set a default if no level is given.
    • strongly preferred solution, as it keeps the data model consistent.
    • a good default is to be chosen in dialogue with the user base (may also differ from glue service to glue service?)
  2. Create a new, 6th integer value (0? 6?) to signify "this is a auto-generated severity value"
    • not a "clean" solution, as it requires a new level to be added to the data model that does not follow the hierarchy of the existing ones.
    • But it will, in most cases, make sure no incidents are lost.

Compatibility and existing data

To ensure backward compatibility, assign the default value to all previously created incidents. It's not pretty but it keeps the data model clean and simple for the future.

Summary

Introducing NULL values for the sake of minimally-invasive backwards compatibility sounds nice now, but it comes at a price: Exception handling whenever you deal with severity values, and many additional design choices to make.

Long story short, to introduce filtering on a certain field

it pays off to make sure all entries have a sensible value for the field by default. Use NA/NULL only if non-avoidable (errors!), but not as part of the design. I'm convinced that this can be done, and will save both users and developers a lot of headaches in the future.

katsel commented 3 years ago

This needs more discussion with the USC.

hmpf commented 3 years ago

Field-name: "level", fiekd-type: Integer (1 or bigger), or foreign key. 1: highest priority.

lunkwill42 commented 3 years ago

I agree that allowing NULL values for severity level is not a good idea. However, that still leaves us with some design decisions:

  1. What should the default severity value be for old records, when the new database field is added?
  2. Should specifying severity be optional in the API?

If we make severity values required in the incident creation API

This is a breaking API change. Strictly speaking, we should increase the API version number if we decide to do this.

At the moment, there are only two known API clients in production use:

The NAV glue service

The current API version will actually ignore unknown attributes that are posted. I.e., the glue service can be upgraded in advance to post severity values.

However, the API version is encoded in the URL to the API endpoint. In other words, the glue service will still break if we change the API version number.

Question: How can a client discover which API version is supported by the backend it is connecting to, when the API version is part of the URL to the API?

The Argus frontend

There are some indications that the current version of the frontend will break if new incident attributes are introduced in the API, even if these do not need to be supported by the frontend at all. If this is true, this will also break the frontend even if the severity attribute is not made mandatory when posting. This needs to be investigated

katsel commented 3 years ago

Severity is a central concept in any alert system. This raises the question if severity should ever be optional.

Let's mentally go down the route of making it mandatory. This means prioritising clarity and API consistence over transitional solutions, quickfixes, and potential inconsistencies introduced by NULL values.

If severity is mandatory

Any incident/event posted to Argus needs to have a severity value assigned.

Who will provide this severity level? In the optimum case, the glue service. Argus will not provide defaults -- just complain if it is missing.

Implementation is plainly done by adding a new data field to the API for the severity value and making it mandatory.

Going forward:

  1. Severity needs to be documented and standardised.
    • It is a global property, but it is set de-centrally in the glue-service. So there must be a common concept.
    • If each glue service does it differently, users soon get an inconsistent view on the data and severity will be useless.
  2. In Argus: Add severity field and bump new API version
  3. glue services: Must support the new API version. Must learn to calculate/compute/assign severity.
  4. in production: Migrate old production data to the new scheme
    • old records do not have a severity value. Following the new scheme, they need one.
    • some options: Set a severity value manually for all, set severity by some rules, re-import old data via the updated glue services (2), or dismissing old data

Some general guidance on REST API versioning: https://restfulapi.net/versioning/

katsel commented 3 years ago

What should the default severity value be for old records, when the new database field is added

Am I thinking too simplified here, @lunkwill42?


Should specifying severity be optional in the API?

No.


Interesting thoughts on API versioning. Currently it is done pretty straightforward: via /v1/ URL. Old URLs breaking with every version bump is often a desired feature of API versioning. It tells users that there is a breaking change (pun intended) and they have to update their REST clients to use the new version (or downgrade the REST server). There is no golden rule that REST-APIs have to be backwards-compatible. Things can (foreseeably) break, and that's good.

Having backwards-compatible APIs is useful in some cases: Think popular APIs with 100k of users, that have been up for years, you want a transition period, and you have the dev power to maintain 2 APIs. Fair enough. The number of Argus' API consumers however is low and manageable. The overhead of maintaining two simultaneous API versions cancels out the good. (A case of Hauptversionsnummernerhöhungsangst, maybe? ;)

How can a client discover which API version is supported by the backend it is connecting to, when the API version is part of the URL to the API?

By parsing the URL. Easy.

More info at https://restfulapi.net/versioning/.

Re-doing the API versioning scheme to get rid of the /v1 is a different issue. If you would like to do that, I suggest filing one.


Frontend:

I think we already discovered that the Argus crew has not bothered with API versioning in the past, and now we have two areas where this is becoming apparent. Both are consumers of the API: Glue services (via remodeling the severity model) and the frontend.

Since both are API clients, the solution is one thing: Adopting the habit/best practice of documenting compatibility between Argus server version, API version, and client version (separately for each glue service and the frontend).

It's not directly related to severity values, but important enough to write an issue about. (Hint, hint)

hmpf commented 3 years ago

Nitpick: in SQL, an SQL NULL covers N/A and None. There's no standardized way to show Invalid, that needs to be handled in code, or to be even more pdantic: a NULLable column is not invalid if that column in a specific row is SQL NULL. A non-NULLable column cannot store SQL NULLs. ( I want to avoid ppl imprinting on NULL always meaning "invalid" in all contexts.)

hmpf commented 3 years ago

Another nitpick: how to version an API should not be part and parcel of this issue.

hmpf commented 3 years ago

Default level while we update the glue service and front end: 3 (middle)

katsel commented 3 years ago

Fundamental API and data model changes sometimes make it necessary to dismiss old data. Adding a default value is a massive bodge. Introducing this with all possible side-effects (that are not completely foreseeable yet) means adding technical debt to the project. It messes with the severity definition defined above. It is a mere stopgap solution.

This is not just about a new field in the database. The old incidents would still be around, confusing users. The user learns: "severity 3 means moderate severity -- unless the incident is older than X, in that case it is just a default and has no meaning (but I keep seeing these incidents that are irrelevant to me when I filter for 3)". Teaching the user this, we make our own severity levels unreliable and arbitrary. Not a great design choice. In my opinion, the benefit of having old data around doesn't outweigh this flaw.

If the old data is so important, there is always the option to add proper severity values to it. Values that are in sync with actual severity, and harmonise with the users' expectations. If that seems like too much work, it is a clue that the old data is not important enough, and can be dismissed.

And that would be my preferred solution.

hmpf commented 3 years ago

Closing this as by #288 there exists a field "level" in the database schema, settable by sources, viewable by frontend.