centreon / centreon-archived

Centreon is a network, system and application monitoring tool. Centreon is the only AIOps Platform Providing Holistic Visibility to Complex IT Workflows from Cloud to Edge.
https://www.centreon.com
GNU General Public License v2.0
578 stars 240 forks source link

All hosts and services Status in the top counter and the monitoring page flapping to UNKNOWN then back to normal every few minutes #12048

Closed ericMoulot closed 1 year ago

ericMoulot commented 1 year ago

BUG REPORT INFORMATION

Prerequisites

Versions

$ rpm -qa | grep centreon | egrep -v "(plugin|pack)" | sort

centreon-auto-discovery-server-21.10.3-2.el8.noarch centreon-base-config-centreon-engine-21.10.10-1.el8.noarch centreon-broker-21.10.3-3.el8.x86_64 centreon-broker-cbd-21.10.3-3.el8.x86_64 centreon-broker-cbmod-21.10.3-3.el8.x86_64 centreon-broker-core-21.10.3-3.el8.x86_64 centreon-broker-storage-21.10.3-3.el8.x86_64 centreon-clib-21.10.3-3.el8.x86_64 centreon-common-21.10.10-1.el8.noarch centreon-connector-21.10.3-3.el8.x86_64 centreon-connector-perl-21.10.3-3.el8.x86_64 centreon-connector-ssh-21.10.3-3.el8.x86_64 centreon-engine-21.10.3-3.el8.x86_64 centreon-engine-daemon-21.10.3-3.el8.x86_64 centreon-engine-extcommands-21.10.3-3.el8.x86_64 centreon-gorgone-21.10.3-1.el8.noarch centreon-gorgone-centreon-config-21.10.3-1.el8.noarch centreon-license-manager-21.10.0-1.el8.noarch centreon-license-manager-common-21.10.0-1.el8.noarch centreon-perl-libs-21.10.10-1.el8.noarch centreon-poller-centreon-engine-21.10.10-1.el8.noarch centreon-pp-manager-21.10.0-2.el8.noarch centreon-release-21.10-5.el8.noarch centreon-trap-21.10.10-1.el8.noarch centreon-web-21.10.10-1.el8.noarch centreon-widget-engine-status-21.10.0-2.el8.noarch centreon-widget-global-health-21.10.1-1.el8.noarch centreon-widget-graph-monitoring-21.10.0-2.el8.noarch centreon-widget-grid-map-21.10.0-2.el8.noarch centreon-widget-hostgroup-monitoring-21.10.0-2.el8.noarch centreon-widget-host-monitoring-21.10.0-2.el8.noarch centreon-widget-httploader-21.10.0-2.el8.noarch centreon-widget-live-top10-cpu-usage-21.10.0-2.el8.noarch centreon-widget-live-top10-memory-usage-21.10.0-2.el8.noarch centreon-widget-servicegroup-monitoring-21.10.0-2.el8.noarch centreon-widget-service-monitoring-21.10.1-1.el8.noarch centreon-widget-tactical-overview-21.10.0-2.el8.noarch

Operating System

RedHat 8

Browser used

Version: Firefox 106.0.1, Chrome 107.0.5304.63

Additional environment details (AWS, VirtualBox, physical, etc.): Virtual Machine ESXi

Description

I did a fresh install of Centreon 21.10.8, with a Central and Database server.

Soon after adding hosts and services to be monitored, I noticed that the status of hosts and services go to UNKNOWN for a few seconds (about 5s to 10s) before coming back to normal. The same happens in the monitoring views: the status of monitored hosts and services go to UNKNOWN for a few seconds (about 5s to 10s) before coming back to normal. This happens at random every few minutes.

However the real status of the servers is unchanged, and checks on the command line from Central poller are OK.

UNKNOWN_7

What I have tried:

I also noticed their is a javascript called vendor.2d6b7428.js that makes a large number of status requests (once every 2s) to the API right after the first status requests initiated by the Web Page itself. Found it on the server at location /usr/share/centreon/www/static/vendor.2d6b7428.js and in the header of the Centreon web page in a statement:

<script defer="defer" scr="./static/vendor.2d6b7428.js"><script>

The flapping behavior persists.

Steps to Reproduce

  1. I logged in Centreon
  2. I reached the Monitoring View
  3. I observed for a few minutes to an hour.

Describe the received result

Hosts and services go to UNKNOWN for a few seconds (about 5s to 10s) before coming back to normal. The same happens in the monitoring views: the status of monitored hosts and services go to UNKNOWN for a few seconds (about 5s to 10s) before coming back to normal. This happens at random every few minutes.

Describe the expected result

Hosts and service status should not flap from unknown to the actual status.

Logs

PHP error logs

For version using PHP 7.2 or 7.3 on centOs 8 or PHP 8

tail -f /var/log/php-fpm/centreon-error.log
[27-Oct-2022 19:22:53 Africa/Dakar] WARNING: Warning: Trying to access array offset on value of type null {"exception":"[object] (ErrorException(code: 0): Warning: Trying to access array offset on value of type null at /usr/share/centreon/src/Centreon/Infrastructure/Contact/ContactRepositoryRDB.php:252)"}

[27-Oct-2022 19:24:06 Africa/Dakar] WARNING: Warning: Undefined array key 50702 {"exception":"[object] (ErrorException(code: 0): Warning: Undefined array key 50702 at /usr/share/centreon/src/Centreon/Infrastructure/Contact/ContactRepositoryRDB.php:252)"}

[27-Oct-2022 19:24:06 Africa/Dakar] WARNING: Warning: Trying to access array offset on value of type null {"exception":"[object] (ErrorException(code: 0): Warning: Trying to access array offset on value of type null at /usr/share/centreon/src/Centreon/Infrastructure/Contact/ContactRepositoryRDB.php:252)"}

[27-Oct-2022 19:25:19 Africa/Dakar] WARNING: Warning: Undefined array key 50702 {"exception":"[object] (ErrorException(code: 0): Warning: Undefined array key 50702 at /usr/share/centreon/src/Centreon/Infrastructure/Contact/ContactRepositoryRDB.php:252)"}

[27-Oct-2022 19:25:19 Africa/Dakar] WARNING: Warning: Trying to access array offset on value of type null {"exception":"[object] (ErrorException(code: 0): Warning: Trying to access array offset on value of type null at /usr/share/centreon/src/Centreon/Infrastructure/Contact/ContactRepositoryRDB.php:252)"}

centreon-engine logs (if needed)

tail -f /var/log/centreon-engine/centengine.log
[1666928282] [469920] Processing object config file '/etc/centreon-engine/meta_timeperiod.cfg'
[1666928282] [469920] Processing object config file '/etc/centreon-engine/meta_host.cfg'
[1666928282] [469920] Processing object config file '/etc/centreon-engine/meta_services.cfg'
[1666928282] [469920] Reading resource file '/etc/centreon-engine/resource.cfg'
[1666928282] [469920] Warning: Notifier 'medanalytics02' has no notification time period defined!
[1666928282] [469920] Configuration reloaded, main loop continuing.
[1666928282] [469920] Reload configuration finished.
[1666961218] [469920] SERVICE ALERT: edennet-esxi1;Ping;CRITICAL;SOFT;1;CRITICAL - 10.187.171.84 lost 66% > 50%
[1666961508] [469920] SERVICE ALERT: edennet-esxi1;Ping;OK;SOFT;2;OK - 10.187.171.84 rta 0,788ms lost 0%
[1666961808] [469920] SERVICE ALERT: edennet-esxi1;Ping;OK;HARD;1;OK - 10.187.171.84 rta 0,778ms lost 0%

centreon-broker logs (if needed)

tail -f /var/log/centreon-broker/central-broker-master.log
[2022-10-24T12:40:13.200+00:00] [core] [info] main: configuration update requested
[2022-10-24T12:40:13.201+00:00] [core] [info] /var/log/centreon-broker//Central-broker.log : log started
[2022-10-24T12:40:13.201+00:00] [core] [info] modules: attempt to load '/usr/share/centreon/lib/centreon-broker/50-tcp.so' which is already loaded
[2022-10-24T12:40:13.201+00:00] [core] [info] modules: attempt to load '/usr/share/centreon/lib/centreon-broker/80-sql.so' which is already loaded
[2022-10-24T12:40:13.201+00:00] [core] [info] modules: attempt to load '/usr/share/centreon/lib/centreon-broker/20-storage.so' which is already loaded
[2022-10-24T12:40:13.201+00:00] [core] [info] modules: attempt to load '/usr/share/centreon/lib/centreon-broker/15-stats.so' which is already loaded
[2022-10-24T12:40:13.201+00:00] [core] [info] multiplexing: engine started
[2022-10-24T12:40:13.604+00:00] [sql] [error] SQL: host group 135 does not exist - insertion before insertion of members
[2022-10-24T12:40:13.611+00:00] [sql] [error] SQL: host group 70 does not exist - insertion before insertion of members
[2022-10-24T12:40:13.613+00:00] [sql] [error] SQL: host group 69 does not exist - insertion before insertion of members

centreon gorgone logs for Centreon >= 20.4 (if needed)

tail -f /var/log/centreon-gorgone/gorgoned.log
Accept: */*
X-AUTH-TOKEN: mRra76IpP34pyiDPlkgP+Pg0cec8QDt9LRTx3J0Xgx32cRO1A8HQsIvaF/J3yiDB
Content-Type: application/json; charset=utf-8
Accept-Type: application/json; charset=utf-8

2022-10-28 17:30:30 - DEBUG - => Recv header: HTTP/1.1 200 OK
2022-10-28 17:30:30 - DEBUG - => Recv header: Date: Fri, 28 Oct 2022 17:30:30 GMT
2022-10-28 17:30:30 - DEBUG - => Recv header: Server: Apache
2022-10-28 17:30:30 - DEBUG - => Recv header: Cache-Control: no-cache, private
2022-10-28 17:30:30 - DEBUG - => Recv header: Api-Version: 21.10
2022-10-28 17:30:30 - DEBUG - => Recv header: X-Frame-Options: sameorigin
2022-10-28 17:30:30 - DEBUG - => Recv header: Transfer-Encoding: chunked
2022-10-28 17:30:30 - DEBUG - => Recv header: Content-Type: application/json
2022-10-28 17:30:30 - DEBUG - => Recv header:
2022-10-28 17:30:30 - DEBUG - => Recv data: 604
{"web":{"version":"21.10.8","major":"21","minor":"10","fix":"8"},"modules":{"centreon-clapi":{"version":"1.5.0","major":"1","minor":"5","fix":"0"},"ndo-management":{"version":"1.1","major":"1","minor":"1","fix":"0"},"centreon-pp-manager":{"version":"21.10.0","major":"21","minor":"10","fix":"0"},"SAM":{"version":"1.5.1","major":"1","minor":"5","fix":"1"},"centreon-open-tickets":{"version":"19.10.0","major":"19","minor":"10","fix":"0"},"centreon-license-manager":{"version":"21.10.0","major":"21","minor":"10","fix":"0"}},"widgets":{"Host Monitoring":{"version":"21.10.0","major":"21","minor":"10","fix":"0"},"Service Monitoring":{"version":"21.10.1","major":"21","minor":"10","fix":"1"},"Graph Monitoring":{"version":"21.10.0","major":"21","minor":"10","fix":"0"},"Servicegroup Monitoring":{"version":"21.10.0","major":"21","minor":"10","fix":"0"},"Live Top 10 Memory Usage":{"version":"21.10.0","major":"21","minor":"10","fix":"0"},"Live Top 10 CPU Usage":{"version":"21.10.0","major":"21","minor":"10","fix":"0"},"HTTP Loader":{"version":"21.10.0","major":"21","minor":"10","fix":"0"},"Hostgroup Monitoring":{"version":"21.10.0","major":"21","minor":"10","fix":"0"},"Grid-map":{"version":"21.10.0","major":"21","minor":"10","fix":"0"},"Tactical Overview":{"version":"21.10.0","major":"21","minor":"10","fix":"0"},"Global Health":{"version":"21.10.1","major":"21","minor":"10","fix":"1"},"Engine-status":{"version":"21.10.0","major":"21","minor":"10","fix":"0"},"Open Tickets":{"version":"19.10.0","major":"19","minor":"10","fix":"0"}}}
0

2022-10-28 17:30:30 - DEBUG - == Info: Connection #7 to host 127.0.0.1 left intact

Additional relevant information (e.g. frequency, ...)

The status of hosts and services go to UNKNOWN for a few seconds (about 5s to 10s) before coming back to normal. The same happens in the monitoring views: the status of monitored hosts and services go to UNKNOWN for a few seconds (about 5s to 10s) before coming back to normal. This happens at random every few minutes.

ericMoulot commented 1 year ago

A solution was found in issue #5609 ; the solution consisted of setting the parameter Instance timeout (Configuration > Pollers > Broker configuration > Output > Instance timeout OR /etc/centreon-broker/central-broker.json) to it's default value. The value previously set by my team was 20 seconds, which caused a race condition between the freshness verification task, and the refresh interval for resource statuses, resulting in the flapping statuses.

Additional information for future readers: