centreon / centreon-engine

Extremely fast monitoring scheduler, forked from Nagios
GNU General Public License v2.0
42 stars 17 forks source link

MAJOR BUG: Notifications randomly aren't generated/sent #53

Open btassite opened 8 years ago

btassite commented 8 years ago

CES 3.3, updated to Centreon 2.7.5.

THIS IS A MAJOR BUG - Monitoring without reliable notification is useless! During testing of dependencies (and later escalations) I noticed that service notifications are almost never sent at all. Removing the dependencies does not make a difference. Removing the escalations does not make a difference.

Specific test case: host X, service Y on host X, host X's network goes down, both are set to 1 minute check intervals, retries, notification intervals. Both a host alert and a host notification appears in the log, so does a service alert, but no service notification is logged/sent, no timeout or other error is logged.

While setting up an escalation, I specified both host X and service Y, notifications are only sent out for the host state, not for the service, EXCEPT once, when I deleted the host from the escalation definition and left the service. After changing other parameters (retry interval etc.) it again reverted to NOT sending out notifications for the service.

The Centreon server currently monitors ~600 servers with almost 1900 services, but load, memory, storage etc. are all OK. I do see service notifications for other services, so it's not a mail setup issue or similar. Upping the mail notification timeout from 30 to 60 seconds makes no difference (and no timeout errors appear in the log anyway, so this is probably not necessary to begin with).

(Also posted on the Centreon forum, which doesn't seem to be read by devs: https://forum.centreon.com/forum/plugins-aa/notification-et-escalation/141942-major-bug-notifications-are-unreliable)

Update: since then I have been testing some more and it seems pretty random whether notifications are sent for my test service, with the host notifications being more reliable. Sometimes notifications are sent out for both, then when host/service come back and go down again only for the host.

btassite commented 8 years ago

Update: some more testing later, I have determined that notifications for the service are not sent out if the host check changes status first, i.e. a service notification you have set randomly DOES and DOES NOT send out notifications, depending on the timing of the checks.

This timing is not fixed either, I've been shutting a switch port to test a failure condition for a host and a service on that host, in one iteration I get a service and a host notification, on another iteration of opening and closing the port only a host notification - without touching Centreon - simply because the checks have been scheduled the other way around by the engine.

This issue may be clouded by the fact that I have been testing dependencies and escalations before, but these definitions have been removed (similarly weren't working reliably, quite possibly due to the same issue). If this is the problem then we have database corruption that needs to be manually fixed.

To make sure it is not a dependency/escalation issue, I added a new host with a different name but the same IP address and a service check to it. Now I get the escalation notifications that I was testing with the first host and have since deleted! So this is definitely a database relation issue, it seems a relation is made purely based on IP address, and secondarily, a no longer existing escalation is actively used.

So there seems to be three separate bugs here:

  1. service notifications not sent out for the same host if host check changes status first
  2. dependencies and/or escalations aren't deleted from the database
  3. host definitions are confused when they have the same IP address

Please provide guidance on how to determine what part of the database is affected and how to fix it, this is a production system!

btassite commented 8 years ago

Update: tested with a different host/service on a different IP that I have not defined any dependencies/escalations on previously, point 1. remains valid, i.e. service notifications aren't generated if the host check/notification happens first.

bouda1 commented 7 years ago

Hi,

Here answers for the three reported bugs:

  1. service notifications not sent out for the same host if host check changes status first

In the case of down hosts, services notifications are not sent.

  1. dependencies and/or escalations aren't deleted from the database

This should be fixed in last versions.

  1. host definitions are confused when they have the same IP address

Could you send a detailed scenario where you have such confusions.

Thanks.

btassite commented 7 years ago

ad 1) if this is a "by design/won't fix", please tag this as an RFE, it should be optional (there are use cases where you still need the service notifications, e.g. when the host hosts various services that belong to other departments and notifications go out to different people for the host and the various services running on it)

ad 3) as detailed above:

I added a new host with a different name but the same IP address and a service check to it. Now I get the escalation notifications that I was testing with the first host and have since deleted! So this is definitely a database relation issue, it seems a relation is made purely based on IP address, and secondarily, a no longer existing escalation is actively used.

ganoze commented 7 years ago

@lpinsivy what do you think about point 1) above ?

oragain commented 5 years ago

As a user, I agree with 1), service notifications should still be sent even if the host is down. Services can be targeted at the application business team and the host goes to the infrastructure team.