Alignak states management - to be discussed

mohierf commented 7 years ago

This issue to sum-up the Alignak states management made for the hosts and services.

Initial state: A new host/service that has not been checked is set in its configured initial_state. Currently, the default initial state is UNREACHABLE (x or 4) for an host and UNREACHABLE (x or 4) for a service.

Checks plugins states: When a check plugin is executed, its exit code determines the host/service state identifier (state_id).

For an host (plugin code -> state identifier -> state):

0 -> 0 -> UP
1 -> 2
2 -> 1 -> DOWN
3 -> 1 -> DOWN
4 -> 4 -> UNREACHABLE
any other -> 1 -> DOWN

This tricky 1->2 is for passive checks... Indeed, only an exit code of 0 says that the host is UP 😉

For a service (plugin code -> state identifier -> state):

0 -> 0 -> OK
1 -> 1 -> WARNING
2 -> 2 -> CRITICAL
3 -> 3 -> UNKNOWN
4 -> 4 -> UNREACHABLE
- any other -> 2 -> CRITICAL

Note that the Nagios legacy plugins will never return 4 as an exit code... It is an Alignak internal value used when a service attached to an host is unreachable because the service's host is down.

Freshness check: When the freshness check is enabled and the freshness threshold expires, the host/service state is set accordingly to the freshness_state configured. Currently, the default freshness state is UNREACHABLE (x or 4) for an host and UNREACHABLE (x or 4) for a service.

mohierf commented 7 years ago

As suggested on the IRC channel, using a new state (UNTESTED, -1) for the initial state would be of some interest:

allow to know that an host/service has not yet been checked
use this state as a starting point for comuting the system availability SLA
trig an event handler if the first check raises the same state as the initial state

More comments and ides are welcomed 😉

fjvt commented 7 years ago

I would like to add a split in the 'unknown' status.

2 scenario's:

An issue with the plugin (exception, scripting error etc) causes the check to go into unknown
An issue on the target side (missing binary, script, etc) causes the check to go into unknown

in both scenario's alignak cannot determine the state of the service (while the service itself may be up/down/etc ... we just can't check ...) but the cause is different.

In scenario 1 the monitoring people should correct the plugin In scenario 2 the sysadmins need to make sure the binaries/scripts are present on the target.

I feel having both scenario's under 'unknown' without being able to differ between the 2 is not good. Specifically for reporting (SLA) it makes a huge difference ... Scenario 1 is beyong the control of the owner of the service (application, sysadmin, etc). While scenario 2 is the responsibility of the app owner

Notifications for scenario 1 and 2 go to different people/teams. Which is not possible if they both report the same state ...

ddurieux commented 7 years ago

Not want another state, to see if check has been done, check the last_check date. Add this state will come with a complexity and cases when don't think...

mohierf commented 7 years ago

@ddurieux :

as of now, we have many tests to avoid unexpected regression
state management is quite grouped in the code and I think it is not that hard to add another state
I think that @fjvt and @spea1 have good reasons for this feature 😉

spea1 commented 7 years ago

I think there should be more "states" possible. For example To integrate new things into the future

status

I would also allow the user to change the status mapping e.g.: OK = ERROR ERROR = WARNING ...

fjvt commented 7 years ago

@spea1

is the last thing smth like this ?

http://shinken.readthedocs.io/en/latest/07_advanced/result-modulations.html

mohierf commented 7 years ago

I agree with @fjvt for this. It is not the role f the framework to do such things.
And concerning all the new states, it is the same response. Please have a look to the result modulations feature that would allow such things probably

fjvt commented 7 years ago

@mohierf Result modulations (are they in alignak ? i assume they are since alignak is a fork) will only fix the modification of states.

Result modulation will not allow to do what both of us wanted (IE split unknown into more than 1 state of have more states).

It only allows to switch between the 4 existing states (0 ok, 1 warning, 2 critical, 3 unknown)

Having only 4 types of state is imo not dynamic enough. In todays modern environments and combined with business rules you need more states ...

Services are not always 'up' or 'down' they can be somewhere in between.

See also my example about the 'unknown' status ... Impossible to detect if the unknown comes from the platform or from the target (well it is possible to detect this in your plugin BUT you can't have alignak notify differnt people since Warning/critical are allready used ...)

I feel that adding more states would make alignak a lot more future proof ...

xkilian commented 7 years ago

I thought there was a Pending state as in Shinken? The pending state (same as untested) is a valid state.

mohierf commented 7 years ago

@xkilian : in Shinken, the Pending state is only used as long as a service check has not yet been launched. As far as I remember this state is only a running state because while in Pending state the host/service has its initial state, as defined in the configuration files.

But I like your idea to have a Pending state, indicating that the host/service has not yet been checked 😉

ddurieux commented 7 years ago

We have removed this pending state to use the initial_state and we defined it by default as UNREACHABLE for host and service... I'm not very agree to add a new state

Alignak-monitoring / alignak

Alignak states management - to be discussed #849