Cacti / spine

Spine C Based Poller for Cacti
GNU Lesser General Public License v2.1
80 stars 45 forks source link

cacti poller time overshooting to 300s or more apprently because of spine error #182

Closed gj00354347 closed 3 years ago

gj00354347 commented 3 years ago

Dear Contributors,

we have a multi poller environment on cacti 1.2.14 with one master poller and 9 remote pollers . for the last few months we have been observing the poller time overshoot on all remote pollers , polling time reaching to 300s or more and when we check the logs we find a fatal error related to spine . because we have noticed a correlation with this error in the logs ad poller time overshooting , so I think this issue with spine is related with poller time overshoot.

PFB error that we see

2021-01-15 00:12:25 - SPINE: Poller[9] PID[23274] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) 2021-01-15 00:12:25 - SPINE: Poller[9] PID[23293] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) 2021-01-15 00:17:18 - SPINE: Poller[9] PID[26604] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) 2021-01-15 00:17:21 - SPINE: Poller[9] PID[26479] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) 2021-01-15 00:17:23 - SPINE: Poller[9] PID[26685] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) 2021-01-15 00:17:23 - SPINE: Poller[9] PID[26484] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) 2021-01-15 00:17:23 - SPINE: Poller[9] PID[26555] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) 2021-01-15 00:17:24 - SPINE: Poller[9] PID[26638] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) 2021-01-15 00:17:24 - SPINE: Poller[9] PID[26514] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) 2021-01-15 00:17:24 - SPINE: Poller[9] PID[26509] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) 2021-01-15 00:22:18 - SPINE: Poller[9] PID[29516] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) 2021-01-15 00:22:19 - SPINE: Poller[9] PID[29670] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) 2021-01-15 00:22:19 - SPINE: Poller[9] PID[29475] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) 2021-01-15 00:22:19 - SPINE: Poller[9] PID[29557] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) 2021-01-15 00:22:19 - SPINE: Poller[9] PID[29431] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine th

Please note I have grepped only error .

2021-01-15 00:12:01 - SYSTEM STATS: Time:297.4902 Method:spine Processes:10 Threads:10 Hosts:219 HostsPerProcess:22 DataSources:43835 RRDsProcessed:0 2021-01-15 00:17:01 - SYSTEM STATS: Time:297.1938 Method:spine Processes:10 Threads:10 Hosts:219 HostsPerProcess:22 DataSources:43835 RRDsProcessed:0 2021-01-15 00:22:03 - SYSTEM STATS: Time:297.0725 Method:spine Processes:10 Threads:10 Hosts:219 HostsPerProcess:22 DataSources:43835 RRDsProcessed:0 2021-01-15 00:27:01 - SYSTEM STATS: Time:297.5614 Method:spine Processes:10 Threads:10 Hosts:219 HostsPerProcess:22 DataSources:43835 RRDsProcessed:0

cacti version - 1.2.14 spine version - 1.2.14 php - PHP 5.4.16 MySQL - mysql Ver 15.1 Distrib 5.5.65-MariaDB Red Hat Enterprise Linux Server release 7.8 (Maipo)

can you please help us as we see load , memory consumption , I/o stats all seems normal and only clue we have is the spine related error .

although I see in change log of cacti 1.2.16 , a similar bug is fixed but the error code is 11 whereas in my case it is 6 , which says no such device or address found . PFB bug fixed in changelog of 1.2.16

issue#3948: Spine 1.2.15 - Spine Encountered An Unhandled Exception Signal Number: '6' [11, Resource temporarily unavailable] (Spine thread)

Let me know if something is required from my side.

Best Regards, Gopal gopal.jee1729@gmail.com

netniV commented 3 years ago

I have moved this over to the spine repo, but can you run spine in the readonly verbose mode to see what device is causing the above failure?

gj00354347 commented 3 years ago

although I do not see this error always but frequently . I just run spine on main poller as well as on one of the remote poller and here I attach the output of read only and -V=5 i.e debug ..although this do not give me any clue

gj00354347 commented 3 years ago

PFA remote poller spine run in versobse =5 and read only mode

gj00354347 commented 3 years ago

PFA remote poller spine output run in verbose=5 and read only mode

gj00354347 commented 3 years ago

attachment

gj00354347 commented 3 years ago

I am not able to upload the output file ..I just do drag and drop and when I click mouse on comment the attachment is not there . any trick I need to use to upload the files

gj00354347 commented 3 years ago

here I paste the output

DEBUG: The log_destination variable is 4 (STDOUT) DEBUG: The path_php variable is /bin/php DEBUG: The availability_method variable is 2 DEBUG: The ping_recovery_count variable is 3 DEBUG: The ping_failure_count variable is 2 DEBUG: The ping_method variable is 2 DEBUG: The ping_retries variable is 3 DEBUG: The ping_timeout variable is 500 DEBUG: The snmp_retries variable is 3 DEBUG: The log_perror variable is 1 DEBUG: The log_pwarn variable is 1 DEBUG: The boost_redirect variable is 1 DEBUG: The boost_rrd_update_enable variable is 1 DEBUG: The log_pstats variable is 1 DEBUG: The threads variable is 10 DEBUG: The polling interval is 300 seconds DEBUG: The number of concurrent processes is 5 DEBUG: The script timeout is 60 DEBUG: The selective_device_debug variable is 2237 DEBUG: The spine_log_level variable is 1 DEBUG: The number of php script servers to run is 1 DEBUG: StartDevice='-1', EndDevice='-1', TotalPHPScripts='0 DEBUG: The PHP Script Server is Not Required DEBUG: The Maximum SNMP OID Get Size is 10 Selective Debug Devices 2237 Version 1.2.14 starting DEBUG: MySQL is Thread Safe! DEBUG: Capability CAP_NET_RAW is set. DEBUG: Spine has cap_net_raw capability. DEBUG: Spine has got ICMP SPINE: Initializing Net-SNMP API DEBUG: Issues with SNMP Header Version information, assuming old version of Net-SNMP. SPINE: Initializing PHP Script Server(s) NOTE: Spine will support multithread device polling. DEBUG: Initial Value of Active Threads is 0 DEBUG: Valid Thread to be Created DEBUG: In Poller, About to Start Polling of Device for Device ID 0 SPINE: Active Threads is 1, Pending is 1 Device[0] HT[1] Total Time: 0.0013 Seconds Device[0] HT[1] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function DEBUG: The Value of Active Threads is 0 for Device ID 0 POLLER: Active Threads is 0, Pending is 0 SPINE: The Final Value of Threads is 0 DEBUG: Thread Cleanup Complete DEBUG: PHP Script Server Pipes Closed DEBUG: Allocated Variable Memory Freed DEBUG: MYSQL Free & Close Completed DEBUG: Net-SNMP Close Completed Time: 0.1144 s, Threads: 10, Devices: 1

gj00354347 commented 3 years ago

we also encountered error in separate error log related to MySQL . Error in `/opt/SP/cacti/spine/spine': double free or corruption (out): 0x00007fa48000d820 ======= Backtrace: ========= /lib64/libc.so.6(+0x81299)[0x7fa49ccdc299] /usr/lib64/mysql/libmysqlclient.so.18(mysql_close+0x242)[0x7fa49e262e62] /opt/SP/cacti/spine/spine[0x40d58d] /opt/SP/cacti/spine/spine[0x40f92c] /lib64/libpthread.so.0(+0x7ea5)[0x7fa49d2a2ea5] /lib64/libc.so.6(clone+0x6d)[0x7fa49cd598dd]

gj00354347 commented 3 years ago

I also see some error related to segmentation fault being complained by spine.

cacti.log-20210101.gz:2021-01-01 21:12:28 - SPINE: Poller[4] PID[55212] FATAL: Spine Encountered a Segmentation Fault [0, Success] (Spine thread) cacti.log-20210101.gz:2021-01-01 21:12:28 - SPINE: Poller[4] PID[54982] FATAL: Spine Encountered a Segmentation Fault [0, Success] (Spine thread) cacti.log-20210101.gz:2021-01-01 21:12:29 - SPINE: Poller[4] PID[54974] FATAL: Spine Encountered An Unhandled Exception Signal Number: '6' [6, No such device or address] (Spine thread) cacti.log-20210101.gz:2021-01-01 21:12:29 - SPINE cacti.log-20210114.gz:2021-01-14 23:22:27 - SPINE: Poller[4] PID[59234] FATAL: Spine Encountered a Segmentation Fault [0, Success] (Spine thread)

bmfmancini commented 3 years ago

Hey Check this similar ticket out https://github.com/Cacti/spine/issues/174 Try updating to 1.2.16 I know this has come up a few times

gj00354347 commented 3 years ago

@bmfmancini :Thanks for help . we are in the process of updating cacti and spine to latest version ie 1.2.16 ..but I also see in the comments of ticket #174 that the reporter disabled some slow query log on the server (Problem solved changing some slow query log by PHP scripts. Thanks a lot! :)) although we have not enabled any such logs ..But lets see how does it behave after updating spine to 1.2.16

I come back after updating spine to 1.2.16