centreon / centreon-broker

A full-featured monitoring event broker, compatible with MySQL, RRDtool, Graphite and more
Apache License 2.0
37 stars 15 forks source link

The central doesn’t have any updates, nothing written to the database #585

Closed StefThomas closed 2 years ago

StefThomas commented 3 years ago

Hi,

I’ve updated a latest 19.04 plateforme to 20.04.13 (following this documentation: https://docs.centreon.com/20.04/fr/upgrade/upgrade-from-19-04.html). While I had no problem doing so on another plateforme, on this one Centreon isn’t working anymore. I use ZMQ as the Gogrgone communication protocole. There don’t seem to have any problem with Gorgone. If I order a restart from the UI (central), the engines are restarted, even on the distant poller, but the central can’t see any update :

image

And gorgoned’s log shows no error. I have no error in /var/opt/rh/rh-php72/log/php-fpm/centreon-error.log too.

I disabled the remote pollers while I’m trying to get rid on the problem.

I also removed the cache as advised to do so by the support :

systemctl stop cbd && rm -f /var/lib/centreon-broker/{*cache*,*new*,*unprocess*} && systemctl start cbd

All the supervision data are stacking in /var/lib/centreon-broker/ :

[root@SR082501CTI3700 ~]# ls -la /var/lib/centreon-broker/
total 2426788
drwxrwxr-x.  3 centreon-broker centreon-broker     4096 14 mai   06:46 .
drwxr-xr-x. 39 root            root                4096 10 mai   10:41 ..
-rw--w----   1 centreon-broker centreon-broker        5 18 nov.   2019 .bash_history
-rw-rw-r--   1 centreon-broker centreon-broker 99999952 12 mai   18:34 central-broker-master.queue.central-broker-master-perfdata14
-rw-rw-r--   1 centreon-broker centreon-broker 99999956 12 mai   22:13 central-broker-master.queue.central-broker-master-perfdata15
-rw-rw-r--   1 centreon-broker centreon-broker 99999852 13 mai   01:47 central-broker-master.queue.central-broker-master-perfdata16
-rw-rw-r--   1 centreon-broker centreon-broker 99999799 13 mai   05:24 central-broker-master.queue.central-broker-master-perfdata17
-rw-rw-r--   1 centreon-broker centreon-broker 99999816 13 mai   09:01 central-broker-master.queue.central-broker-master-perfdata18
-rw-rw-r--   1 centreon-broker centreon-broker 99999986 13 mai   12:37 central-broker-master.queue.central-broker-master-perfdata19
-rw-rw-r--   1 centreon-broker centreon-broker 99999812 13 mai   16:15 central-broker-master.queue.central-broker-master-perfdata20
-rw-rw-r--   1 centreon-broker centreon-broker 99999729 13 mai   19:52 central-broker-master.queue.central-broker-master-perfdata21
-rw-rw-r--   1 centreon-broker centreon-broker 99999920 13 mai   23:30 central-broker-master.queue.central-broker-master-perfdata22
-rw-rw-r--   1 centreon-broker centreon-broker 99999758 14 mai   03:06 central-broker-master.queue.central-broker-master-perfdata23
-rw-rw-r--   1 centreon-broker centreon-broker 99999723 14 mai   06:44 central-broker-master.queue.central-broker-master-perfdata24
-rw-rw-r--   1 centreon-broker centreon-broker 72282354 14 mai   09:20 central-broker-master.queue.central-broker-master-perfdata25
-rw-rw-r--   1 centreon-broker centreon-broker 99999926 12 mai   18:38 central-broker-master.queue.central-broker-master-sql14
-rw-rw-r--   1 centreon-broker centreon-broker 99999888 12 mai   22:15 central-broker-master.queue.central-broker-master-sql15
-rw-rw-r--   1 centreon-broker centreon-broker 99999871 13 mai   01:51 central-broker-master.queue.central-broker-master-sql16
-rw-rw-r--   1 centreon-broker centreon-broker 99999862 13 mai   05:28 central-broker-master.queue.central-broker-master-sql17
-rw-rw-r--   1 centreon-broker centreon-broker 99999720 13 mai   09:04 central-broker-master.queue.central-broker-master-sql18
-rw-rw-r--   1 centreon-broker centreon-broker 99999813 13 mai   12:43 central-broker-master.queue.central-broker-master-sql19
-rw-rw-r--   1 centreon-broker centreon-broker 99999732 13 mai   16:18 central-broker-master.queue.central-broker-master-sql20
-rw-rw-r--   1 centreon-broker centreon-broker 99999942 13 mai   19:57 central-broker-master.queue.central-broker-master-sql21
-rw-rw-r--   1 centreon-broker centreon-broker 99999875 13 mai   23:33 central-broker-master.queue.central-broker-master-sql22
-rw-rw-r--   1 centreon-broker centreon-broker 99999763 14 mai   03:11 central-broker-master.queue.central-broker-master-sql23
-rw-rw-r--   1 centreon-broker centreon-broker 99999765 14 mai   06:46 central-broker-master.queue.central-broker-master-sql24
-rw-rw-r--   1 centreon-broker centreon-broker 70606267 14 mai   09:20 central-broker-master.queue.central-broker-master-sql25
prw-rw-r--   1 centreon-broker centreon-broker        0 14 mai   09:17 central-broker-master-stats.json
-rw-rw-r--   1 centreon-broker centreon-broker        8 12 mai   16:40 central-broker-master.unprocessed
-rw-rw-r--   1 centreon-broker centreon-broker  5543945 10 mai   11:00 central-broker.memory.central-broker-perfdata
-rw-rw-r--   1 centreon-broker centreon-broker  5543945 10 mai   11:00 central-broker.memory.central-broker-sql
-rw-rw-r--   1 centreon-broker centreon-broker  5543945 10 mai   11:00 central-broker.memory.centreon-broker-rrd
prw-rw-r--   1 centreon-broker centreon-broker        0 14 mai   09:17 central-rrd-master-stats.json
-rw-rw-r--   1 centreon-broker centreon-broker        8 12 mai   16:40 central-rrd-master.unprocessed
drwxrwxr-x   2 centreon-broker centreon-broker       28 10 juil.  2019 .config

(I already moved some of those files on a backup space so the filesystem would not fill during the yesterday holiday (hence the index not starting from 1).

I activated full debug on the broker. This is the only error message I have, in /var/log/centreon-broker/central-broker-master.log :

[1620977034] error:   feeder: error occured while processing client 'central-broker-master-input-4': Attempt to read data from peer 127.0.0.1:43110 on a closing socket

And the suffixe is increasing if I order another restart :

[1620977163] error:   feeder: error occured while processing client 'central-broker-master-input-5': Attempt to read data from peer 127.0.0.1:44098 on a closing socket

In the central-module-master.log I have this message :

[1620977235] config:  stats: cannot stat() '/var/lib/centreon-engine/central-module-master-stats.json': No such file or directory

But the file exists :

# ls -la /var/lib/centreon-engine/central-module-master-stats.json
prw-rw-r-- 1 centreon-engine centreon-engine 0 14 mai   09:27 /var/lib/centreon-engine/central-module-master-stats.json

I also had this message :

/var/log/centreon-broker/central-broker-master.log-20210513:[1620830471] error:   conflict_manager: error in the main loop: storage: insertion of metric 'tcp.response.time.seconds' of index 4899198 failed: Duplicate entry '333127' for key 'PRIMARY'

But it seems not to be related. This is what the support told me and I can’t see it anymore today. It was probably caused by some of the data I moved to gain space on /var/lib/centreon-broker/

I’ll continue to try to troubleshoot this issue, check again once more the settings of the broker, but I don’t have any idea anymore. I already checked the configuration multiple times. I have no clue.

# rpm -qa |egrep 'centreon-(broker|web|engine|gorgone)'
centreon-engine-daemon-20.04.11-2.el7.centos.x86_64
centreon-broker-cbd-20.04.13-4.el7.centos.x86_64
centreon-broker-core-20.04.13-4.el7.centos.x86_64
centreon-web-20.04.13-1.el7.centos.noarch
centreon-broker-20.04.13-4.el7.centos.x86_64
centreon-broker-cbmod-20.04.13-4.el7.centos.x86_64
centreon-gorgone-centreon-config-20.04.10-1.el7.centos.noarch
centreon-base-config-centreon-engine-20.04.13-1.el7.centos.noarch
centreon-broker-storage-20.04.13-4.el7.centos.x86_64
centreon-poller-centreon-engine-20.04.13-1.el7.centos.noarch
centreon-engine-20.04.11-2.el7.centos.x86_64
centreon-gorgone-20.04.10-1.el7.centos.noarch
centreon-engine-extcommands-20.04.11-2.el7.centos.x86_64

The other plateforme had been upgrade from the latest 19.04 to Centreon 20.04.12 (it was some weeks ago), but it has been updated to those same versions (20.04.13) and I experience no problem on this one.

StefThomas commented 3 years ago

More information : In fact, while Broker is at the same version on both plateformes, Centreon Web is not. I have a 20.04.12 on the plateforme which works, and 20.04.13 on the plateforme which doesn’t.

I’ll try to downgrade, if I can.

StefThomas commented 3 years ago

Downgrade didn’t fix the issue. Also, I upgraded to centreon-web 20.04.13 on the working plateforme and I can see no bug on this one.

The errros messages I have are in centreon-broker-master.log and centreon-rrd-master.log and are the following (they repeat a lot, I just pasted a few here) :

[1620979447] error:   SQL: service group 73 does not exist - insertion before insertion of members
[1620979448] error:   mysql_connection: could not store host group membership:  Duplicate entry '5240-243' for key 'host_id'
[1620979518] error:   RRD: ignored update error in file '/var/lib/centreon/status/5879818.rrd': /var/lib/centreon/status/5879818.rrd: illegal attempt to update using time 1620825350 when l
ast update time is 1620825350 (minimum one second step)

Any idea?

StefThomas commented 3 years ago

After some other tests I can tell it is updating until the following error message appears in /var/log/centreon-broker/central-broker-master.log :

[1620991591] error:   mysql_connection: Duplicate entry '333148' for key 'PRIMARY'
[1620991591] error:   conflict_manager: error in the main loop: storage: insertion of metric 'count' of index 39066258 failed: Duplicate entry '333148' for key 'PRIMARY'

Then nothing more is written to this file. The cbd and cbdw are still running :

● cbd.service - Centreon Broker watchdog
   Loaded: loaded (/etc/systemd/system/cbd.service; enabled; vendor preset: disabled)
   Active: active (running) since ven. 2021-05-14 13:26:01 CEST; 23min ago
  Process: 22318 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
 Main PID: 32494 (cbwd)
   CGroup: /system.slice/cbd.service
           ├─32494 /usr/sbin/cbwd /etc/centreon-broker/watchdog.json
           ├─32495 /usr/sbin/cbd /etc/centreon-broker/central-broker.json
           └─32496 /usr/sbin/cbd /etc/centreon-broker/central-rrd.json

mai 14 13:26:01 SR082501CTI3700.hm.dm.ad systemd[1]: Started Centreon Broker watchdog.
mai 14 13:26:01 SR082501CTI3700.hm.dm.ad cbwd[32494]: [1620991561] info:    file: module for Centreon Broker 20.04.13
mai 14 13:26:01 SR082501CTI3700.hm.dm.ad cbwd[32494]: [1620991561] config:  log applier: applying 1 logging objects
mai 14 13:26:01 SR082501CTI3700.hm.dm.ad cbwd[32494]: [1620991561] info:    file: module for Centreon Broker 20.04.13
mai 14 13:26:01 SR082501CTI3700.hm.dm.ad cbwd[32494]: [1620991561] config:  log applier: applying 1 logging objects

I have this error systematically when I restart the cbd service.

I have manually forced some checks, then I can see it has updated some data, until the error above happened :

image

Sims24 commented 3 years ago

@StefThomas I sent you a Webex link through the support portal. Please join it so we can troubleshoot this together.

Simon

joschi99 commented 3 years ago

We have same issue on latest 20.10.5 Broker and Centreon Web 20.10.8. The problem occurs on our system when you add a Host to a Hostgroup. This will cause the problem on Central Broker and blocks all other Events. Is there any hotfix available?

joschi99 commented 3 years ago

This will only happen on hostgroup created after the update.

joschi99 commented 3 years ago

If have done some test, broker has some problems to update/insert the table hosts_hostgroups:

  1. have created a new host without host group
  2. export to Poller
  3. Host is present on table hosts
  4. created a new host group
  5. add the host to the host group
  6. export to Poller
  7. Host is present on table hosts, Hostgroup is present on table hostgroups, but no relation between host and hostgroup on table host_hostgroups inserted
joschi99 commented 3 years ago

Have done some other tests, problem is also present on latest Centreon 21.04:

[2021-07-02 08:29:37.850] [core] [debug] accept ('central-broker-master-input') failed.
[2021-07-02 08:29:40.850] [core] [debug] accept ('central-broker-master-input') failed.
[2021-07-02 08:29:43.088] [sql] [error] mysql_connection: could not store host group membership:  Lock wait timeout exceeded; try restarting transaction
[2021-07-02 08:29:43.851] [core] [debug] accept ('central-broker-master-input') failed.
[2021-07-02 08:29:46.851] [core] [debug] accept ('central-broker-master-input') failed.

Problem is not related only when you assign hosts to a new created hostgroup but it will happen also with existing hostgroups, but not any time, seems sporadic. But when will happen, broker is blocked

joschi99 commented 3 years ago

On this master-broker debug log you can see the issue starting from [2021-07-02 08:39:35]

central-broker-master.zip

joschi99 commented 3 years ago

Any news about this problem? Also with latest updates

omercier commented 2 years ago

Hi all, We have released quite a lot of fixes regarding broker connection stability issues (see release notes) recently. Do you still face such errors after updating to the latest 20.10/21.04 (the fixes are going to be released in 20.04 as well quite soon)? Regards,