centreon / centreon-archived

Centreon is a network, system and application monitoring tool. Centreon is the only AIOps Platform Providing Holistic Visibility to Complex IT Workflows from Cloud to Edge.
https://www.centreon.com
GNU General Public License v2.0
574 stars 240 forks source link

Communication broken from Central to Distant Poller #8897

Open wackou72 opened 4 years ago

wackou72 commented 4 years ago

Versions

Centreon 20.04.4

For the RPM based systems

centreon-widget-graph-monitoring-20.04.0-5.el7.centos.noarch
centreon-connector-perl-20.04.0-2.el7.centos.x86_64
centreon-license-manager-common-20.04.2-1.el7.centos.noarch
centreon-perl-libs-20.04.4-6.el7.centos.noarch
centreon-20.04.4-6.el7.centos.noarch
centreon-plugin-Applications-Monitoring-Centreon-Map4-Jmx-20200803-072952.el7.centos.noarch
centreon-plugin-Applications-Protocol-Dns-20200803-072952.el7.centos.noarch
centreon-widget-servicegroup-monitoring-20.04.0-5.el7.centos.noarch
centreon-widget-global-health-20.04.0-5.el7.centos.noarch
centreon-trap-20.04.4-6.el7.centos.noarch
centreon-plugin-Applications-Databases-Mysql-20200803-072952.el7.centos.noarch
centreon-widget-host-monitoring-20.04.4-4.el7.centos.noarch
centreon-broker-storage-20.04.7-1.el7.centos.x86_64
centreon-widget-tactical-overview-20.04.0-5.el7.centos.noarch
centreon-widget-engine-status-20.04.1-1.el7.centos.noarch
centreon-widget-live-top10-memory-usage-20.04.0-5.el7.centos.noarch
centreon-connector-20.04.0-2.el7.centos.x86_64
centreon-pp-manager-20.04.1-1.el7.centos.noarch
centreon-nrpe-plugin-2.15-4.el7.centos.x86_64
centreon-gorgone-20.04.3-1.el7.centos.noarch
centreon-engine-extcommands-20.04.4-1.el7.centos.x86_64
centreon-plugin-Applications-Monitoring-Centreon-Database-20200803-072952.el7.centos.noarch
centreon-plugin-Virtualization-Vmware2-Connector-Plugin-20200803-072952.el7.centos.noarch
centreon-auto-discovery-server-20.04.4-3.el7.centos.noarch
centreon-plugin-Hardware-Ups-Standard-Rfc1628-Snmp-20200803-072952.el7.centos.noarch
centreon-plugin-Operatingsystems-Linux-Snmp-20200803-072952.el7.centos.noarch
centreon-broker-20.04.7-1.el7.centos.x86_64
centreon-broker-cbmod-20.04.7-1.el7.centos.x86_64
centreon-plugins-base-1.18-2.el7.centos.noarch
centreon-release-20.04-1.el7.centos.noarch
centreon-widget-hostgroup-monitoring-20.04.0-5.el7.centos.noarch
centreon-connector-ssh-20.04.0-2.el7.centos.x86_64
centreon-license-manager-20.04.2-1.el7.centos.noarch
centreon-gorgone-centreon-config-20.04.3-1.el7.centos.noarch
centreon-web-20.04.4-6.el7.centos.noarch
centreon-engine-daemon-20.04.4-1.el7.centos.x86_64
centreon-base-config-centreon-engine-20.04.4-6.el7.centos.noarch
centreon-widget-service-monitoring-20.04.3-3.el7.centos.noarch
centreon-plugin-Hardware-Printers-Generic-Snmp-20200803-072952.el7.centos.noarch
centreon-plugin-Applications-Monitoring-Centreon-Central-20200803-072952.el7.centos.noarch
centreon-plugin-Applications-Monitoring-Centreon-Poller-20200803-072952.el7.centos.noarch
centreon-plugin-Operatingsystems-Windows-Snmp-20200803-072952.el7.centos.noarch
centreon-broker-cbd-20.04.7-1.el7.centos.x86_64
centreon-plugin-Virtualization-VMWare-daemon-3.1.2-20200602093832.el7.centos.noarch
centreon-widget-httploader-20.04.0-5.el7.centos.noarch
centreon-widget-live-top10-cpu-usage-20.04.0-5.el7.centos.noarch
centreon-poller-centreon-engine-20.04.4-6.el7.centos.noarch
centreon-plugin-Applications-Protocol-Http-20200803-072952.el7.centos.noarch
centreon-plugin-Network-Cisco-Standard-Snmp-20200803-072952.el7.centos.noarch
centreon-broker-core-20.04.7-1.el7.centos.x86_64
centreon-clib-20.04.0-7.el7.centos.x86_64
centreon-widget-grid-map-20.04.0-5.el7.centos.noarch
centreon-common-20.04.4-6.el7.centos.noarch
centreon-engine-20.04.4-1.el7.centos.x86_64
centreon-database-20.04.4-6.el7.centos.noarch
centreon-plugin-Applications-Protocol-Ldap-20200803-072952.el7.centos.noarch
centreon-plugin-Applications-Protocol-Ftp-20200803-072952.el7.centos.noarch

Operating System

CentOS 7.8.2003

Browser used

Description

After a certain amount of time (2 to 3 days), communication between the Central and Distant Poller is broken. The Central get the informations of services/hosts but Scheduled Downtime and applying new configuration doesn't work Running systemctl restart cbd centengine gorgoned and verything is back to normal. Please note that I've updated to 20.04.4 and switch to the ZMQ protocol. Communication works for a certain amount of time.

Steps to Reproduce

Configuration Pollers --> Export Configurations

Describe the received result

No configuration applied

Describe the expected result

New configuration should be applied

Logs

Let me know which logs is needed

Additional relevant information (e.g. frequency, ...)

I tried to look at different logs files and I don't see any error or defect. Look like https://github.com/centreon/centreon/issues/8799

lpinsivy commented 4 years ago

Hi @wackou72 ,

Do you find errors in /var/log/centreon-gorgone/gorgoned.log?

wackou72 commented 4 years ago

Hello @lpinsivy , I didn't see any error, just INFO But I notice something, their is a ping happening regurlaly and a pong from the other servers (I assume) The log since monday morning show this, and yesterday it was the last one : 2020-08-11 12:36:39 - INFO - [proxy] Send pings 2020-08-11 12:36:40 - INFO - [proxy] Received setlogs for '4' 2020-08-11 12:36:40 - INFO - [proxy] Pong received from '4' 2020-08-11 12:36:40 - INFO - [proxy] Received setlogs for '5' 2020-08-11 12:36:40 - INFO - [proxy] Pong received from '5' 2020-08-11 12:36:40 - INFO - [proxy] Received setlogs for '3' 2020-08-11 12:36:40 - INFO - [proxy] Pong received from '3' 2020-08-11 12:36:40 - INFO - [proxy] Received setlogs for '2' 2020-08-11 12:36:40 - INFO - [proxy] Pong received from '2' Now I see only 2020-08-11 16:16:14 - INFO - [proxy] Send pings 2020-08-11 16:17:34 - INFO - [proxy] Send pings 2020-08-11 16:18:54 - INFO - [proxy] Send pings

Without any pong answer.

I do not restart the Central service this morning (systemctl restart cbd centengine gorgoned) I also know that something went wrong because my scheduled downtimes on the distant poller are not working. Please note that the communication of the hosts/services are working : 2020-08-12 08_45_29-Centreon - IT   Network Monitoring

lpinsivy commented 4 years ago

So you have an issue for Centreon Gorgone communication (from central to pollers) bu everything is ok when Centreon Engine forward collected data to database using Centreon Broker.

Can you check the status of 'gorgoned' service on your pollers and the associated /var/log/centreon-gorgone/gorgoned.log?

wackou72 commented 4 years ago

Service gorgoned is up and urnning on all my servers. 2020-08-13 16_15_20-root@BRSPOCENTREON01P_~

Here is the log of /var/log/centreon-gorgone/gorgoned.log on my remote servers 2020-08-13 16_16_42- Strangely, the log stop exactly at the same time (all 4 remote servers are on different TimeZone)

2020-08-11 14:44:17 - INFO - [action] Copy processing - Received chunk for '/etc/centreon-engine//' 2020-08-11 14:44:17 - INFO - [action] Copy processing - Copy to '/etc/centreon-engine//' finished successfully 2020-08-11 14:44:17 - INFO - [action] Copy processing - Received chunk for '/etc/centreon-broker/' 2020-08-11 14:44:17 - INFO - [action] Copy processing - Copy to '/etc/centreon-broker/' finished successfully 2020-08-11 14:44:45 - INFO - [action] Copy processing - Received chunk for '/etc/centreon-engine//' 2020-08-11 14:44:45 - INFO - [action] Copy processing - Copy to '/etc/centreon-engine//' finished successfully 2020-08-11 14:44:45 - INFO - [action] Copy processing - Received chunk for '/etc/centreon-broker/' 2020-08-11 14:44:45 - INFO - [action] Copy processing - Copy to '/etc/centreon-broker/' finished successfully

lpinsivy commented 4 years ago

Hi @wackou72 can you update to latest version of gorgone (20.04.4) on Central server and restart gorgoned process?

Regards,

wackou72 commented 4 years ago

Hi @lpinsivy I upgraded gorgone to 20.04.4 and restart all my servers. 2020-08-18 09_44_45-Centreon - IT   Network Monitoring I will let you know if I encounter the issue and post the log of /var/log/centreon-gorgone/gorgoned.log

wackou72 commented 4 years ago

Unfortunately, that didn't solve the issue Here is the last message : 2020-08-19 00:55:00 - INFO - [proxy] Send pings

Starting now, I can't apply new configuration, restart the remote pollers, can't force immediate check and Scheduled Downtime doesn't work until I run systemctl restart cbd centengine gorgoned Once run, the log and the ping is working again : 2020-08-19 13:07:02 - INFO - [proxy] Create module 'proxy' child process for pool id '1' 2020-08-19 13:07:02 - INFO - [proxy] Create module 'proxy' child process for pool id '2' 2020-08-19 13:07:02 - INFO - [proxy] Create module 'proxy' child process for pool id '3' 2020-08-19 13:07:02 - INFO - [proxy] Create module 'proxy' child process for pool id '4' 2020-08-19 13:07:02 - INFO - [proxy] Create module 'proxy' child process for pool id '5' 2020-08-19 13:07:02 - INFO - [core] Setcoreid changed 1 2020-08-19 13:07:02 - INFO - [proxy] Node '2' is registered 2020-08-19 13:07:02 - INFO - [proxy] Node '3' is registered 2020-08-19 13:07:02 - INFO - [proxy] Node '4' is registered 2020-08-19 13:07:02 - INFO - [proxy] Node '5' is registered 2020-08-19 13:07:03 - INFO - [zmqclient] Client connected successfully to 'tcp://1.1.1.1:5556' 2020-08-19 13:07:04 - INFO - [zmqclient] Client connected successfully to 'tcp://2.2.2.2:5556' 2020-08-19 13:07:04 - INFO - [proxy] Pong received from '2' 2020-08-19 13:07:05 - INFO - [zmqclient] Client connected successfully to 'tcp://3.3.3.3:5556' 2020-08-19 13:07:05 - INFO - [zmqclient] Client connected successfully to 'tcp://4.4.4.4:5556' 2020-08-19 13:07:05 - INFO - [proxy] Pong received from '4' 2020-08-19 13:07:05 - INFO - [proxy] Pong received from '3' 2020-08-19 13:07:05 - INFO - [proxy] Pong received from '5' 2020-08-19 13:07:20 - INFO - [proxy] Send pings 2020-08-19 13:07:20 - INFO - [proxy] Pong received from '4' 2020-08-19 13:07:20 - INFO - [proxy] Pong received from '5' 2020-08-19 13:07:21 - INFO - [proxy] Pong received from '3' 2020-08-19 13:07:21 - INFO - [proxy] Pong received from '2'

Let me know if you need the full log of the Central and/or the remote pollers and if I need to enable something to have all the debuging message

lpinsivy commented 4 years ago

When you try to export configuration or re-schedule a command, what is the result on gorgoned.log?

Regards,

wackou72 commented 4 years ago

here are the output : 2020-08-24 15_18_44-Window 2020-08-24 15_18_16-Window

cgagnaire commented 4 years ago

Hi @wackou72, Thanks for the info. Can you provide us the full log of the Gorgone on the Central from the last restart to the time it starts failing ? Can you do that with debug level ? You can activate debug from 'Administration > Parameters > Debug'. You'll need to restart gorgoned to apply it.

wackou72 commented 4 years ago

Hello @cgagnaire Should I enable everything ? image

cgagnaire commented 4 years ago

Hi @wackou72, No, only the Centreon Gorgone debug.

wackou72 commented 4 years ago

Hi @cgagnaire Ok I set the debug mode as requested and restart everything. I will let you know once I got the defect

wackou72 commented 4 years ago

Hi @cgagnaire Where I can sent you the file ? Their is server name etc and I want this to be private. Regards

cgagnaire commented 4 years ago

Hi @wackou72, Send it to the email address of my account.

Midorip commented 4 years ago

Hello

I have the exact same issue right now with the same behavior, it work some times (one hour to one day) and i have to restart my master process to make external command & acknowledge worked again.

I already put Gorgone on debug, i can give you some additionnal input in this screenshot, when it started to not working : image

Do you need other logs ?

Regards,

cgagnaire commented 4 years ago

Hi @Midorip, Can you try the latest version of Gorgone from unstable repository: yum update centreon-gorgone\* --enablerepo=centreon-unstable*

You can rollback to latest stable with a downgrade command in case of problem: yum downgrade centreon-gorgone\*

wackou72 commented 4 years ago

Hello @cgagnaire and @lpinsivy Did this issue has been solved ? I managed to solve the issue with the unstable repo, when it will be sync to the stable branch ? Regards.