centreon / centreon-engine

Extremely fast monitoring scheduler, forked from Nagios
GNU General Public License v2.0
42 stars 17 forks source link

Engine stop working on Pollers (randomly?) #556

Closed joschi99 closed 2 years ago

joschi99 commented 3 years ago

Poller

centreon-poller-centreon-engine-21.04.5-1.el7.centos.noarch centreon-engine-daemon-21.04.3-5.el7.centos.x86_64 centreon-engine-21.04.3-5.el7.centos.x86_64 centreon-engine-extcommands-21.04.3-5.el7.centos.x86_64 centreon-broker-core-21.04.3-1.el7.centos.x86_64 centreon-broker-storage-21.04.3-1.el7.centos.x86_64 centreon-broker-21.04.3-1.el7.centos.x86_64 centreon-broker-cbmod-21.04.3-1.el7.centos.x86_64

Central

centreon-engine-extcommands-21.04.3-5.el7.centos.x86_64 centreon-base-config-centreon-engine-21.04.5-1.el7.centos.noarch centreon-poller-centreon-engine-21.04.5-1.el7.centos.noarch centreon-widget-engine-status-21.04.0-1.el7.centos.noarch centreon-engine-daemon-21.04.3-5.el7.centos.x86_64 centreon-engine-21.04.3-5.el7.centos.x86_64 centreon-broker-cbd-21.04.3-1.el7.centos.x86_64 centreon-broker-graphite-21.04.3-1.el7.centos.x86_64 centreon-broker-21.04.3-1.el7.centos.x86_64 centreon-broker-cbmod-21.04.3-1.el7.centos.x86_64 centreon-broker-storage-21.04.3-1.el7.centos.x86_64 centreon-broker-influxdb-21.04.3-1.el7.centos.x86_64 centreon-broker-core-21.04.3-1.el7.centos.x86_64

Issue description

We have on different Pollers randomly the problem that the engine will stop without errors and the Poller is not more working. After a restart of the engine everything is working normally. We found 1-2 cases where more then one Poller (on different sites) was affected by this problem in the same time, so I think that the problem on the engine/broker on Poller should be triggered from something on Central. The centengine log shows no error, but we found a info on Broker when engine stops:

Poller

[2021-09-08 02:03:54.219] [core] [error] failover: global error: Connection lost
[2021-09-08 02:03:54.220] [core] [info] BBDO: unable to send stop message to peer, it is already stopped: Connection lost
[2021-09-08 08:16:03.421] [core] [info] /var/log/centreon-broker/Poller Barozzi BFB1.log : log started

Engine stops on 2021-09-08 02:03:54, we restart the engine on 2021-09-08 08:16:03 same logs on the other Pollers

Central

[2021-09-08 02:04:24.356] [core] [error] failover: global error: Connection lost
[2021-09-08 02:04:24.356] [core] [info] BBDO: unable to send stop message to peer, it is already stopped: Connection lost

Any idea about this? Happens actually on different Pollers, not able to reproduce the problem in this moment.

joschi99 commented 3 years ago

Had today at 15.37pm another case:

systemctl status centengine.service -l
● centengine.service - Centreon Engine
   Loaded: loaded (/usr/lib/systemd/system/centengine.service; enabled; vendor preset: disabled)
   Active: failed (Result: signal) since Mon 2021-09-20 15:37:57 CEST; 34min ago
  Process: 27866 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
  Process: 24991 ExecStart=/usr/sbin/centengine /etc/centreon-engine/centengine.cfg (code=killed, signal=ABRT)
 Main PID: 24991 (code=killed, signal=ABRT)

Sep 20 15:15:03 i-vertix-bfb centreon-engine[24991]: [1632143703] [24991] SERVICE ALERT: VCENTER01;datastore-io;WARNING;SOFT;2;WARNING: Datastore 'Datastore Normal Performance(10K2)' : rate of reading data: 107.26 MB/s
Sep 20 15:16:03 i-vertix-bfb centreon-engine[24991]: [1632143763] [24991] SERVICE ALERT: VCENTER01;datastore-io;OK;HARD;1;OK: Total rate of reading data: 82.63 MB/s, Total rate of writing data: 7.49 MB/s - All datastores are ok
Sep 20 15:17:03 i-vertix-bfb centreon-engine[24991]: [1632143823] [24991] SERVICE ALERT: VCENTER01;VMware VM Datastore IOPS;OK;HARD;1;OK: All virtual machines are ok
Sep 20 15:17:13 i-vertix-bfb centreon-engine[24991]: [1632143833] [24991] SERVICE ALERT: SRVBFBRDS02;swap;WARNING;SOFT;1;WARNING: Swap Total: 23.00 GB Used: 19.03 GB (82.72%) Free: 3.97 GB (17.28%)
Sep 20 15:18:13 i-vertix-bfb centreon-engine[24991]: [1632143893] [24991] SERVICE ALERT: SRVBFBRDS02;swap;WARNING;SOFT;2;WARNING: Swap Total: 23.00 GB Used: 19.03 GB (82.72%) Free: 3.97 GB (17.28%)
Sep 20 15:19:13 i-vertix-bfb centreon-engine[24991]: [1632143953] [24991] SERVICE ALERT: SRVBFBRDS02;swap;WARNING;HARD;3;WARNING: Swap Total: 23.00 GB Used: 19.17 GB (83.37%) Free: 3.83 GB (16.63%)
Sep 20 15:37:57 i-vertix-bfb centengine[24991]: terminate called after throwing an instance of 'com::centreon::exceptions::msg_fmt'
Sep 20 15:37:57 i-vertix-bfb systemd[1]: centengine.service: main process exited, code=killed, status=6/ABRT
Sep 20 15:37:57 i-vertix-bfb systemd[1]: Unit centengine.service entered failed state.
Sep 20 15:37:57 i-vertix-bfb systemd[1]: centengine.service failed.

Engine failed with message terminate called after throwing an instance of 'com::centreon::exceptions::msg_fmt'

joschi99 commented 2 years ago

Are there any news about this issue?

omercier commented 2 years ago

Hi @joschi99, Do you experience the same issue with our last fixes?

I know it's not the first time I ask you to update your packages, but we have really fixed a lot of issues recently. Regards

joschi99 commented 2 years ago

Hi @omercier, since we have done the update with new versions the problem is not more appeared. But give me some more time to verify this, because also in the past the problem occurs randomly.

joschi99 commented 2 years ago

Hi @omercier, I will close the case, seems solved. After the update update we never had this problem.

Thank you very much