Cacti / spine

Spine C Based Poller for Cacti
GNU Lesser General Public License v2.1
81 stars 45 forks source link

Spine 1.1.38 : results swap #125

Closed sysres-dev closed 4 years ago

sysres-dev commented 4 years ago

Describe the bug Spine seems to be swapping results of tests for a same device in a single polling, it ends up on graphs with erratic data, or empty graph if data cant fit in.

To Reproduce Not sure how to reproduce, this problems happens with 3 graphs templates on our side for the poller i checked. Since our monitoring system is quite big (35k devices), it may also exists somewhere else.

Expected behavior Spine shouldn't swap results.

Screenshots image

Server (please complete the following information):

Compiling (please complete the following information):

Additional context Add any other context about the problem here including logs and command line output

Here is the spine command i played to identify the problem and its output, i added an HERE to lines that shows the problem. /usr/local/spine/bin/spine --mode online --poller 12 --hostlist=65140 --mibs -C /usr/local/spine/etc/spine.conf -R -S

Device[65140] TH[3] DS[1061789] SS[0] SERVER: /var/www/cacti/scripts/ss_host_cpu.php ss_host_cpu 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'usage' '3', output: U
WARNING: Invalid Response, Device[65140] TH[2] DS[1061997] SCRIPT: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'used' '4', output:
Device[65140] TH[2] DS[1061997] SS[1] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'used' '4', output: U
WARNING: Invalid Response, Device[65140] TH[3] DS[1061790] SCRIPT: /var/www/cacti/scripts/ss_host_cpu.php ss_host_cpu 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'usage' '4', output:
Device[65140] TH[3] DS[1061790] SS[2] SERVER: /var/www/cacti/scripts/ss_host_cpu.php ss_host_cpu 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'usage' '4', output: U
WARNING: Invalid Response, Device[65140] TH[2] DS[1061997] SCRIPT: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '4', output:
Device[65140] TH[2] DS[1061997] SS[3] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '4', output: U
Device[65140] TH[2] DS[1061998] SS[0] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'used' '5', output: 5
Device[65140] TH[2] DS[1061998] SS[1] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '5', output: 151920312320
Device[65140] TH[2] DS[1061999] SS[2] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'used' '6', output: 10
Device[65140] TH[2] DS[1061999] SS[3] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '6', output: 429493518336
Device[65140] TH[2] DS[1062000] SS[4] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'used' '7', output: 21
Device[65140] TH[2] DS[1062000] SS[3] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '7', output: 214745153536
Device[65140] TH[2] DS[1062001] SS[4] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'used' '10', output: 148907360256
Device[65140] TH[2] DS[1062001] SS[0] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '10', output: 22
Device[65140] TH[2] DS[1062002] SS[1] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'used' '9', output: 15
Device[65140] TH[2] DS[1062002] SS[2] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '9', output: 14
HERE ==> Device[65140] TH[2] DS[1062011] SS[3] SERVER: /var/www/cacti/scripts/ss_fping.php ss_fping 'custom-c904.custom.domain' '20' 'ICMP' '', output: 214745153536
Device[65140] TH[1] DS[1061994] SS[2] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'used' '1', output: 68716396544
HERE ==> Device[65140] TH[1] DS[1061994] SS[3] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '1', output: min:0.1140 avg:0.2532 max:0.7900 dev:0.1770 loss:0.0000
Device[65140] TH[1] DS[1061995] SS[4] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'used' '2', output: 15
Device[65140] TH[1] DS[1061995] SS[0] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '2', output: 68
Device[65140] TH[1] DS[1061996] SS[1] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'used' '3', output: 20
Device[65140] TH[1] DS[1061996] SS[2] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '3', output: 62324166656

Dont hesitate to ask further informations if needed !

sysres-dev commented 4 years ago

Any update ? I can help and provide more informations if you need.

netniV commented 4 years ago

You would need to narrow down more specifics on this and examples of how the data is definitely swapped, eg, which two fields appear to be swapped, and the log output showing the data being updated into the wrong one to help narrow down how that could be occurring If that problem is because of spine, it will be present in the current version as I can't see a change that addresses this kind of behaviour.

sysres-dev commented 4 years ago

In these two log lines, results are swapped :

The output of this line (it is a disk check) belongs to the test run at the second line (a ping check) :

Device[65140] TH[1] DS[1061994] SS[3] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '1', output: min:0.1140 avg:0.2532 max:0.7900 dev:0.1770 loss:0.0000

However, when i manually run the command for ss_host_disk.php in script server for this host , i dont have 214745153536 as output, so this result may is swapped with something else as well.

Device[65140] TH[2] DS[1062011] SS[3] SERVER: /var/www/cacti/scripts/ss_fping.php ss_fping 'custom-c904.custom.domain' '20' 'ICMP' '', output: 214745153536

I just reran the test, and i observed the following :

Device[65140] TH[1] DS[1061994] SS[3] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '1', output: min:0.1120 avg:0.1291 max:0.1480 dev:0.0108 loss:0.0000 Device[65140] TH[2] DS[1062011] SS[3] SERVER: /var/www/cacti/scripts/ss_fping.php ss_fping 'custom-c904.custom.domain' '20' 'ICMP' '', output: 214745153536 Device[65140] TH[2] DS[1062000] SS[3] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '7', output: 214745153536

Here are the same tests run manually from php script server :

/opt/rh/rh-php70/root/bin/php -q /var/www/cacti/script_server.php

PHP Script Server has Started - Parent is cmd /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '7' 214745153536 /var/www/cacti/scripts/ss_fping.php ss_fping 'custom-c904.custom.domain' '20' 'ICMP' '' min:0.1110 avg:0.1478 max:0.3590 dev:0.0552 loss:0.0000 /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:customcommunity::::::' 'get' 'total' '1' 214759895040

For this host and these tests :

So, from what i have observed so far :

If you need more specific tests i'm ready to help.

netniV commented 4 years ago

Thanks, quite a bit of information for me to review. I'll have time to do so probably at the weekend now.

sysres-dev commented 4 years ago

I'm not sure how to verify it, but since all affected scripts are php scripts, it could be somehow related to the script_server.php, or the way it is called and handled by spine.

cigamit commented 4 years ago

It's likely the snmp timeouts from the previous response. Those should be redirected to standard error, which should be redirected to the cacti standard error log by default. That should be confirmed be a Dev. With the script server, you have to be careful with your script server scripts not outputting to standard output more than a single line, and they need to output that single line at one time. Otherwise the Script Server output will get out of alignment with the poller.

sysres-dev commented 4 years ago

I tried different tests to get more insight about it, i first disabled all data sources, except one which is working properly from running script server manually, i got this output with spine :

/usr/local/spine/bin/spine --mode online --poller 12 --hostlist=65140 --mibs -C /usr/local/spine/etc/spine.conf -R -S -V 5

SPINE: Using spine config file [/usr/local/spine/etc/spine.conf]
2019/12/24 17:31:04 - SPINE: Poller[12] DEBUG: The path_php_server variable is /var/www/cacti/script_server.php
2019/12/24 17:31:04 - SPINE: Poller[12] DEBUG: The path_cactilog variable is /var/log/cacti/cacti.log
DEBUG: The log_destination variable is 4 (STDOUT)
DEBUG: The path_php variable is /opt/rh/rh-php70/root/bin/php
DEBUG: The availability_method variable is 2
DEBUG: The ping_recovery_count variable is 3
DEBUG: The ping_failure_count variable is 2
DEBUG: The ping_method variable is 1
DEBUG: The ping_retries variable is 1
DEBUG: The ping_timeout variable is 600
DEBUG: The snmp_retries variable is 1
DEBUG: The log_perror variable is 1
DEBUG: The log_pwarn variable is 0
DEBUG: The boost_redirect variable is 1
DEBUG: The boost_rrd_update_enable variable is 1
DEBUG: The log_pstats variable is 0
DEBUG: The threads variable is 40
DEBUG: The polling interval is 300 seconds
DEBUG: The number of concurrent processes is 50
DEBUG: The script timeout is 20
DEBUG: The selective_device_debug variable is 46480,65178,46420,65140
DEBUG: The spine_log_level variable is 2
DEBUG: The number of php script servers to run is 5
DEBUG: Device List to be polled='65140', TotalPHPScripts='1
DEBUG: The PHP Script Server is Required
DEBUG: The Maximum SNMP OID Get Size is 20
Selective Debug Devices 46480,65178,46420,65140
Version 1.1.38 starting
DEBUG: MySQL is Thread Safe!
DEBUG: Spine is running asroot.
SPINE: Initializing Net-SNMP API
DEBUG: Issues with SNMP Header Version information, assuming old version of Net-SNMP.
SPINE: Initializing PHP Script Server(s)
DEBUG: SS[0] PHP Script Server Routine Starting
DEBUG: SS[0] PHP Script Server About to FORK Child Process
DEBUG: SS[0] PHP Script Server Child FORK Success
ERROR: SS[0] Script Server did not start properly return message was:
DEBUG: SS[1] PHP Script Server Routine Starting
DEBUG: SS[1] PHP Script Server About to FORK Child Process
DEBUG: SS[1] PHP Script Server Child FORK Success
ERROR: SS[1] Script Server did not start properly return message was:
DEBUG: SS[2] PHP Script Server Routine Starting
DEBUG: SS[2] PHP Script Server About to FORK Child Process
DEBUG: SS[2] PHP Script Server Child FORK Success
ERROR: SS[2] Script Server did not start properly return message was:
DEBUG: SS[3] PHP Script Server Routine Starting
DEBUG: SS[3] PHP Script Server About to FORK Child Process
DEBUG: SS[3] PHP Script Server Child FORK Success
ERROR: SS[3] Script Server did not start properly return message was:
DEBUG: SS[4] PHP Script Server Routine Starting
DEBUG: SS[4] PHP Script Server About to FORK Child Process
DEBUG: SS[4] PHP Script Server Child FORK Success
ERROR: SS[4] Script Server did not start properly return message was:
NOTE: Spine will support multithread device polling.
DEBUG: Initial Value of Active Threads is 0
DEBUG: Valid Thread to be Created
DEBUG: In Poller, About to Start Polling of Device for Device ID 65140
SPINE: Active Threads is 1, Pending is 1
DEBUG: Valid Thread to be Created
SPINE: Active Threads is 2, Pending is 2
DEBUG: In Poller, About to Start Polling of Device for Device ID 65140
DEBUG: Valid Thread to be Created
DEBUG: In Poller, About to Start Polling of Device for Device ID 65140
SPINE: Active Threads is 3, Pending is 3
DEBUG: Valid Thread to be Created
DEBUG: In Poller, About to Start Polling of Device for Device ID 65140
SPINE: Active Threads is 4, Pending is 4
DEBUG: Valid Thread to be Created
DEBUG: In Poller, About to Start Polling of Device for Device ID 65140
SPINE: Active Threads is 5, Pending is 5
DEBUG: Valid Thread to be Created
DEBUG: In Poller, About to Start Polling of Device for Device ID 65140
SPINE: Active Threads is 6, Pending is 6
Device[65140] DEBUG: Entering SNMP Ping
Device[65140] DEBUG: Entering SNMP Ping
Device[65140] SNMP Result: Device responded to SNMP
Updating Full System Information Table
Device[65140] DEBUG: Entering SNMP Ping
Device[65140] DEBUG: Entering SNMP Ping
Device[65140] DEBUG: Entering SNMP Ping
Device[65140] DEBUG: Entering SNMP Ping
Device[65140] TH[2] RECACHE: Processing 8 items in the auto reindex cache for 'custom-c904.custom.domain
Device[65140] TH[2] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[2] Recache DataQuery[8] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[2] Recache DataQuery[9] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[2] Recache DataQuery[17] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[2] Recache DataQuery[18] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[2] Recache DataQuery[65] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[2] Recache DataQuery[73] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[2] Recache DataQuery[75] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[3] RECACHE: Processing 8 items in the auto reindex cache for 'custom-c904.custom.domain
Device[65140] TH[3] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[3] Recache DataQuery[8] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[3] Recache DataQuery[9] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[3] Recache DataQuery[17] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[3] Recache DataQuery[18] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[3] Recache DataQuery[65] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[3] Recache DataQuery[73] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[3] Recache DataQuery[75] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[4] RECACHE: Processing 8 items in the auto reindex cache for 'custom-c904.custom.domain
Device[65140] TH[1] RECACHE: Processing 8 items in the auto reindex cache for 'custom-c904.custom.domain
Device[65140] TH[4] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[4] Recache DataQuery[8] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[4] Recache DataQuery[9] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[4] Recache DataQuery[17] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[4] Recache DataQuery[18] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[4] Recache DataQuery[65] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[4] Recache DataQuery[73] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[4] Recache DataQuery[75] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[6] RECACHE: Processing 8 items in the auto reindex cache for 'custom-c904.custom.domain
Device[65140] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[1] Recache DataQuery[8] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[1] Recache DataQuery[9] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[1] Recache DataQuery[17] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[1] Recache DataQuery[18] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[1] Recache DataQuery[65] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[1] Recache DataQuery[73] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[1] Recache DataQuery[75] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[6] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[6] Recache DataQuery[8] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[6] Recache DataQuery[9] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[6] Recache DataQuery[17] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[6] Recache DataQuery[18] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[6] Recache DataQuery[65] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[6] Recache DataQuery[73] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[6] Recache DataQuery[75] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[5] RECACHE: Processing 8 items in the auto reindex cache for 'custom-c904.custom.domain
Device[65140] TH[5] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[5] Recache DataQuery[8] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[5] Recache DataQuery[9] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[5] Recache DataQuery[17] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[5] Recache DataQuery[18] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[5] Recache DataQuery[65] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[5] Recache DataQuery[73] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[5] Recache DataQuery[75] OID: .1.3.6.1.2.1.1.3.0, output: 417536055
Device[65140] TH[2] Total Time: 0.016 Seconds
Device[65140] TH[3] Total Time: 0.016 Seconds
Device[65140] TH[1] NOTE: There are '1' Polling Items for this Device
Device[65140] TH[2] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function
DEBUG: The Value of Active Threads is 5 for Device ID 65140
POLLR: Active Threads is 5, Pending is 5
Device[65140] TH[4] Total Time: 0.021 Seconds
Device[65140] TH[3] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function
DEBUG: The Value of Active Threads is 4 for Device ID 65140
POLLR: Active Threads is 4, Pending is 4
Device[65140] TH[5] Total Time: 0.021 Seconds
Device[65140] TH[6] Total Time: 0.022 Seconds
Device[65140] TH[5] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function
DEBUG: The Value of Active Threads is 3 for Device ID 65140
POLLR: Active Threads is 3, Pending is 3
Device[65140] TH[4] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function
DEBUG: The Value of Active Threads is 2 for Device ID 65140
POLLR: Active Threads is 2, Pending is 2
Device[65140] TH[6] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function
DEBUG: The Value of Active Threads is 1 for Device ID 65140
POLLR: Active Threads is 1, Pending is 1
WARNING: Invalid Response, Device[65140] TH[1] DS[1062011] SCRIPT: /var/www/cacti/scripts/ss_fping.php ss_fping 'custom-c904.custom.domain' '20' 'ICMP' '', output:
Device[65140] TH[1] DS[1062011] SS[0] SERVER: /var/www/cacti/scripts/ss_fping.php ss_fping 'custom-c904.custom.domain' '20' 'ICMP' '', output: U
Device[65140] TH[1] Total Time: 0.042 Seconds
Device[65140] TH[1] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function
DEBUG: The Value of Active Threads is 0 for Device ID 65140
POLLR: Active Threads is 0, Pending is 0
SPINE: The Final Value of Threads is 0
DEBUG: Thread Cleanup Complete
DEBUG: SS[0] Script Server Shutdown Started
DEBUG: SS[1] Script Server Shutdown Started
DEBUG: SS[2] Script Server Shutdown Started
DEBUG: SS[3] Script Server Shutdown Started
DEBUG: SS[4] Script Server Shutdown Started
DEBUG: PHP Script Server Pipes Closed
DEBUG: Allocated Variable Memory Freed
DEBUG: MYSQL Free & Close Completed
DEBUG: Net-SNMP Close Completed
Time: 1.1891 s, Threads: 40, Devices: 1

So, from what i see here, the behavior is slightly different :

I then increased the number of data sources one by one until i start to get output other than 'U', only enabling thoses based on php scripts using the script_server. I started to have data showing up like in the initial posting after 4 data sources (7 items in poller cache), here is the corresponding spine output :

/usr/local/spine/bin/spine --mode online --poller 12 --hostlist=65140 --mibs -C /usr/local/spine/etc/spine.conf -R -S -V 5

SPINE: Using spine config file [/usr/local/spine/etc/spine.conf]
2019/12/24 17:53:01 - SPINE: Poller[12] DEBUG: The path_php_server variable is /var/www/cacti/script_server.php
2019/12/24 17:53:01 - SPINE: Poller[12] DEBUG: The path_cactilog variable is /var/log/cacti/cacti.log
DEBUG: The log_destination variable is 4 (STDOUT)
DEBUG: The path_php variable is /opt/rh/rh-php70/root/bin/php
DEBUG: The availability_method variable is 2
DEBUG: The ping_recovery_count variable is 3
DEBUG: The ping_failure_count variable is 2
DEBUG: The ping_method variable is 1
DEBUG: The ping_retries variable is 1
DEBUG: The ping_timeout variable is 600
DEBUG: The snmp_retries variable is 1
DEBUG: The log_perror variable is 1
DEBUG: The log_pwarn variable is 0
DEBUG: The boost_redirect variable is 1
DEBUG: The boost_rrd_update_enable variable is 1
DEBUG: The log_pstats variable is 0
DEBUG: The threads variable is 40
DEBUG: The polling interval is 300 seconds
DEBUG: The number of concurrent processes is 50
DEBUG: The script timeout is 20
DEBUG: The selective_device_debug variable is 46480,65178,46420,65140
DEBUG: The spine_log_level variable is 2
DEBUG: The number of php script servers to run is 5
DEBUG: Device List to be polled='65140', TotalPHPScripts='1
DEBUG: The PHP Script Server is Required
DEBUG: The Maximum SNMP OID Get Size is 20
Selective Debug Devices 46480,65178,46420,65140
Version 1.1.38 starting
DEBUG: MySQL is Thread Safe!
DEBUG: Spine is running asroot.
SPINE: Initializing Net-SNMP API
DEBUG: Issues with SNMP Header Version information, assuming old version of Net-SNMP.
SPINE: Initializing PHP Script Server(s)
DEBUG: SS[0] PHP Script Server Routine Starting
DEBUG: SS[0] PHP Script Server About to FORK Child Process
DEBUG: SS[0] PHP Script Server Child FORK Success
ERROR: SS[0] Script Server did not start properly return message was:
DEBUG: SS[1] PHP Script Server Routine Starting
DEBUG: SS[1] PHP Script Server About to FORK Child Process
DEBUG: SS[1] PHP Script Server Child FORK Success
ERROR: SS[1] Script Server did not start properly return message was:
DEBUG: SS[2] PHP Script Server Routine Starting
DEBUG: SS[2] PHP Script Server About to FORK Child Process
DEBUG: SS[2] PHP Script Server Child FORK Success
ERROR: SS[2] Script Server did not start properly return message was:
DEBUG: SS[3] PHP Script Server Routine Starting
DEBUG: SS[3] PHP Script Server About to FORK Child Process
DEBUG: SS[3] PHP Script Server Child FORK Success
ERROR: SS[3] Script Server did not start properly return message was:
DEBUG: SS[4] PHP Script Server Routine Starting
DEBUG: SS[4] PHP Script Server About to FORK Child Process
DEBUG: SS[4] PHP Script Server Child FORK Success
ERROR: SS[4] Script Server did not start properly return message was:
NOTE: Spine will support multithread device polling.
DEBUG: Initial Value of Active Threads is 0
DEBUG: Valid Thread to be Created
DEBUG: In Poller, About to Start Polling of Device for Device ID 65140
SPINE: Active Threads is 1, Pending is 1
DEBUG: Valid Thread to be Created
SPINE: Active Threads is 2, Pending is 2
DEBUG: In Poller, About to Start Polling of Device for Device ID 65140
DEBUG: Valid Thread to be Created
DEBUG: In Poller, About to Start Polling of Device for Device ID 65140
SPINE: Active Threads is 3, Pending is 3
DEBUG: Valid Thread to be Created
DEBUG: In Poller, About to Start Polling of Device for Device ID 65140
SPINE: Active Threads is 4, Pending is 4
DEBUG: Valid Thread to be Created
SPINE: Active Threads is 5, Pending is 5
DEBUG: In Poller, About to Start Polling of Device for Device ID 65140
DEBUG: Valid Thread to be Created
DEBUG: In Poller, About to Start Polling of Device for Device ID 65140
SPINE: Active Threads is 6, Pending is 6
Device[65140] DEBUG: Entering SNMP Ping
Device[65140] DEBUG: Entering SNMP Ping
Device[65140] SNMP Result: Device responded to SNMP
Updating Full System Information Table
Device[65140] DEBUG: Entering SNMP Ping
Device[65140] DEBUG: Entering SNMP Ping
Device[65140] DEBUG: Entering SNMP Ping
Device[65140] DEBUG: Entering SNMP Ping
Device[65140] TH[5] RECACHE: Processing 8 items in the auto reindex cache for 'custom-c904.custom.domain
Device[65140] TH[2] RECACHE: Processing 8 items in the auto reindex cache for 'custom-c904.custom.domain
Device[65140] TH[5] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[5] Recache DataQuery[8] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[5] Recache DataQuery[9] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[5] Recache DataQuery[17] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[5] Recache DataQuery[18] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[5] Recache DataQuery[65] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[5] Recache DataQuery[73] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[5] Recache DataQuery[75] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[2] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[2] Recache DataQuery[8] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[2] Recache DataQuery[9] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[2] Recache DataQuery[17] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[2] Recache DataQuery[18] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[3] RECACHE: Processing 8 items in the auto reindex cache for 'custom-c904.custom.domain
Device[65140] TH[2] Recache DataQuery[65] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[2] Recache DataQuery[73] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[2] Recache DataQuery[75] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[3] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[3] Recache DataQuery[8] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[3] Recache DataQuery[9] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[3] Recache DataQuery[17] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[3] Recache DataQuery[18] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[3] Recache DataQuery[65] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[3] Recache DataQuery[73] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[3] Recache DataQuery[75] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[1] RECACHE: Processing 8 items in the auto reindex cache for 'custom-c904.custom.domain
Device[65140] TH[4] RECACHE: Processing 8 items in the auto reindex cache for 'custom-c904.custom.domain
Device[65140] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[1] Recache DataQuery[8] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[1] Recache DataQuery[9] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[1] Recache DataQuery[17] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[1] Recache DataQuery[18] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[1] Recache DataQuery[65] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[1] Recache DataQuery[73] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[1] Recache DataQuery[75] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[4] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[4] Recache DataQuery[8] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[4] Recache DataQuery[9] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[4] Recache DataQuery[17] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[4] Recache DataQuery[18] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[4] Recache DataQuery[65] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[4] Recache DataQuery[73] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[4] Recache DataQuery[75] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[6] RECACHE: Processing 8 items in the auto reindex cache for 'custom-c904.custom.domain
Device[65140] TH[6] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[6] Recache DataQuery[8] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[6] Recache DataQuery[9] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[6] Recache DataQuery[17] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[6] Recache DataQuery[18] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[6] Recache DataQuery[65] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[6] Recache DataQuery[73] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[6] Recache DataQuery[75] OID: .1.3.6.1.2.1.1.3.0, output: 417667753
Device[65140] TH[5] Total Time: 0.013 Seconds
Device[65140] TH[2] NOTE: There are '2' Polling Items for this Device
Device[65140] TH[3] NOTE: There are '2' Polling Items for this Device
Device[65140] TH[1] NOTE: There are '2' Polling Items for this Device
Device[65140] TH[4] NOTE: There are '1' Polling Items for this Device
Device[65140] TH[5] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function
DEBUG: The Value of Active Threads is 5 for Device ID 65140
POLLR: Active Threads is 5, Pending is 5
Device[65140] TH[6] Total Time: 0.018 Seconds
Device[65140] TH[6] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function
DEBUG: The Value of Active Threads is 4 for Device ID 65140
POLLR: Active Threads is 4, Pending is 4
WARNING: Invalid Response, Device[65140] TH[2] DS[1061996] SCRIPT: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:mycommunity::::::' 'get' 'used' '3', output:
Device[65140] TH[2] DS[1061996] SS[0] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:mycommunity::::::' 'get' 'used' '3', output: U
WARNING: Invalid Response, Device[65140] TH[3] DS[1062000] SCRIPT: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:mycommunity::::::' 'get' 'used' '7', output:
Device[65140] TH[3] DS[1062000] SS[1] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:mycommunity::::::' 'get' 'used' '7', output: U
Device[65140] TH[3] DS[1062000] SS[0] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:mycommunity::::::' 'get' 'total' '7', output: 88438829056
Device[65140] TH[3] Total Time: 0.068 Seconds
Device[65140] TH[3] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function
DEBUG: The Value of Active Threads is 3 for Device ID 65140
POLLR: Active Threads is 3, Pending is 3
WARNING: Invalid Response, Device[65140] TH[1] DS[1061994] SCRIPT: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:mycommunity::::::' 'get' 'used' '1', output:
Device[65140] TH[1] DS[1061994] SS[2] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:mycommunity::::::' 'get' 'used' '1', output: U
Device[65140] TH[1] DS[1061994] SS[1] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:mycommunity::::::' 'get' 'total' '1', output: 148907360256
Device[65140] TH[1] Total Time: 0.081 Seconds
Device[65140] TH[1] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function
DEBUG: The Value of Active Threads is 2 for Device ID 65140
POLLR: Active Threads is 2, Pending is 2
WARNING: Invalid Response, Device[65140] TH[4] DS[1062011] SCRIPT: /var/www/cacti/scripts/ss_fping.php ss_fping 'custom-c904.custom.domain' '20' 'ICMP' '', output:
Device[65140] TH[4] DS[1062011] SS[3] SERVER: /var/www/cacti/scripts/ss_fping.php ss_fping 'custom-c904.custom.domain' '20' 'ICMP' '', output: U
Device[65140] TH[4] Total Time:  0.13 Seconds
Device[65140] TH[4] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function
DEBUG: The Value of Active Threads is 1 for Device ID 65140
POLLR: Active Threads is 1, Pending is 1
WARNING: Invalid Response, Device[65140] TH[2] DS[1061996] SCRIPT: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:mycommunity::::::' 'get' 'total' '3', output:
Device[65140] TH[2] DS[1061996] SS[4] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'custom-c904.custom.domain' '65140' '2:161:10000:1:20:mycommunity::::::' 'get' 'total' '3', output: U
Device[65140] TH[2] Total Time:  0.14 Seconds
Device[65140] TH[2] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function
DEBUG: The Value of Active Threads is 0 for Device ID 65140
POLLR: Active Threads is 0, Pending is 0
SPINE: The Final Value of Threads is 0
DEBUG: Thread Cleanup Complete
DEBUG: SS[0] Script Server Shutdown Started
DEBUG: SS[1] Script Server Shutdown Started
DEBUG: SS[2] Script Server Shutdown Started
DEBUG: SS[3] Script Server Shutdown Started
DEBUG: SS[4] Script Server Shutdown Started
DEBUG: PHP Script Server Pipes Closed
DEBUG: Allocated Variable Memory Freed
DEBUG: MYSQL Free & Close Completed
DEBUG: Net-SNMP Close Completed
Time: 1.1936 s, Threads: 40, Devices: 1

So, from what i can see here, if a php script running in script server have its output empty for some reason, can it get out of alignment with the poller ? Also, important note, every tests in those logs run properly when launched manually from script_server.php.

cigamit commented 4 years ago

Edit your php.ini file, and set the error_log = /tmp/php_errors.log and then watch it. The script server MUST start properly if things are to work well. So, get that issue resolved right away.

sysres-dev commented 4 years ago

I changed my current php configuration regarding errors for this one :

/etc/opt/rh/rh-php70/php.ini:error_reporting = E_ALL & ~E_DEPRECATED & ~E_STRICT
/etc/opt/rh/rh-php70/php.ini:display_errors = On
/etc/opt/rh/rh-php70/php.ini:display_startup_errors = On
/etc/opt/rh/rh-php70/php.ini:log_errors = On
/etc/opt/rh/rh-php70/php.ini:log_errors_max_len = 1024
/etc/opt/rh/rh-php70/php.ini:ignore_repeated_errors = Off
/etc/opt/rh/rh-php70/php.ini:track_errors = On
/etc/opt/rh/rh-php70/php.ini:html_errors = On
/etc/opt/rh/rh-php70/php-fpm.conf:error_log = /var/opt/rh/rh-php70/log/php-fpm/error.log
/etc/opt/rh/rh-php70/php-fpm.d/www.conf:php_admin_value[error_log] = /var/opt/rh/rh-php70/log/php-fpm/www-error.log
/etc/opt/rh/rh-php70/php-fpm.d/www.conf:php_admin_flag[log_errors] = on

The www-error.log remains empty after running spine manually like before, and the error_log dont displays errors : /var/opt/rh/rh-php70/log/php-fpm/error.log

[26-Dec-2019 10:20:03] NOTICE: Terminating ...
[26-Dec-2019 10:20:03] NOTICE: exiting, bye-bye!
[26-Dec-2019 10:20:03] NOTICE: fpm is running, pid 26433
[26-Dec-2019 10:20:03] NOTICE: ready to handle connections
[26-Dec-2019 10:20:03] NOTICE: systemd monitor interval set to 10000ms

If you confirm the issue is not spine related, do you want me to close it and reopen to cacti ?

sysres-dev commented 4 years ago

A complete reinstallation of cacti, and recompilation/reinstallation of spine from 1.1.38 sources didn't solved the problem on a new fresh poller and device ids.

I still have nothing in php_errors/www-error logs.

cigamit commented 4 years ago

Let's try a few debug tests. Here they are:

1) For the device set host threads = 1, save the host and see if the problem continues

2) Set php script servers to 1 for the Data Collector, and see if the problem continues

3) Disable ss_fping.php data source and see if the problem resolves itself.

cigamit commented 4 years ago

Lastly, are you using the HMIB plugin?

sysres-dev commented 4 years ago

I applied steps after steps :

  1. It did not corrected the issue, but i observed a behavior change, some graphs went broken, some others started working, pretty much like if indexes order was affected, but on overall it's better.
  2. I cant apply it on master without breaking our production environment, however i tested it by manually running spine with php_servers DB variable set to 1, and the result is better.
  3. No change observed.

HMIB plugin is not used.

I retried multiples times, i removed datasources not based on php script server, and then data sources based on php script server other than ss_host_disk.php, i also did the same with data sources based on ss_host_cpu.php, and here is what i noticed :

For example, in this case (5 php pollers = 5 WARNING) :

DEBUG: The number of php script servers to run is 5
[...]
Device[65695] TH[1] NOTE: There are '18' Polling Items for this Device
WARNING: Invalid Response, Device[65695] TH[1] DS[1074627] SCRIPT: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'hostname' '65695' '2:161:2000:1:20:community::::::' 'get' 'used' '1', output:
Device[65695] TH[1] DS[1074627] SS[0] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'hostname' '65695' '2:161:2000:1:20:community::::::' 'get' 'used' '1', output: U
WARNING: Invalid Response, Device[65695] TH[1] DS[1074627] SCRIPT: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'hostname' '65695' '2:161:2000:1:20:community::::::' 'get' 'total' '1', output:
Device[65695] TH[1] DS[1074627] SS[1] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'hostname' '65695' '2:161:2000:1:20:community::::::' 'get' 'total' '1', output: U
WARNING: Invalid Response, Device[65695] TH[1] DS[1074628] SCRIPT: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'hostname' '65695' '2:161:2000:1:20:community::::::' 'get' 'used' '2', output:
Device[65695] TH[1] DS[1074628] SS[2] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'hostname' '65695' '2:161:2000:1:20:community::::::' 'get' 'used' '2', output: U
WARNING: Invalid Response, Device[65695] TH[1] DS[1074628] SCRIPT: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'hostname' '65695' '2:161:2000:1:20:community::::::' 'get' 'total' '2', output:
Device[65695] TH[1] DS[1074628] SS[3] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'hostname' '65695' '2:161:2000:1:20:community::::::' 'get' 'total' '2', output: U
WARNING: Invalid Response, Device[65695] TH[1] DS[1074629] SCRIPT: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'hostname' '65695' '2:161:2000:1:20:community::::::' 'get' 'used' '3', output:
Device[65695] TH[1] DS[1074629] SS[4] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'hostname' '65695' '2:161:2000:1:20:community::::::' 'get' 'used' '3', output: U

And in this case (1 php poller = 1 WARNING) :

DEBUG: The number of php script servers to run is 1
[...]
Device[65695] TH[1] NOTE: There are '18' Polling Items for this Device
WARNING: Invalid Response, Device[65695] TH[1] DS[1074627] SCRIPT: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'hostname' '65695' '2:161:2000:1:20:community::::::' 'get' 'used' '1', output:
Device[65695] TH[1] DS[1074627] SS[0] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'hostname' '65695' '2:161:2000:1:20:community::::::' 'get' 'used' '1', output: U
Device[65695] TH[1] DS[1074627] SS[0] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'hostname' '65695' '2:161:2000:1:20:community::::::' 'get' 'total' '1', output: 62415646720
Device[65695] TH[1] DS[1074628] SS[0] SERVER: /var/www/cacti/scripts/ss_host_disk.php ss_host_disk 'hostname' '65695' '2:161:2000:1:20:community::::::' 'get' 'used' '2', output: 214759895040

I also noticed when there is more php data sources to poll, this tend to 'shift' results.

Here are full logs : cacti_all_steps_storage.txt cacti_steps_1_3_storage.txt cacti_all_steps_cpu_storage.txt

cigamit commented 4 years ago

One of the nasty things with Windows devices is that when you have a removable devices like a USB stick, Windows screws up the indexing every time you insert or removed that device. This is why I always recommend the HMIB plugin if you have Windows devices. It's just because of it's defective SNMP agent. I'm not sure this is related, but it's a suspicion.

sysres-dev commented 4 years ago

From what i understood and observed earlier, this issue is affecting us on everything using php script server, it doesn't comes from the data windows displays in snmp, since they are working fine manually.

I can somehow reproduce the problem with the cmd.php (which is easier to edit to debug for me), thought it's the only output i get :

2020/01/16 14:41:28 - POLLER: Poller[13] WARNING: Invalid Response(s), Errors[1] Device[65695] Thread[1] DS[1074627]

I modified a bit of its output to trace what's going on in there, here are the parts of the code i changed for debugging :

705                         case POLLER_ACTION_SCRIPT_PHP: /* script (php script server) */
706                                 $output = trim(str_replace("\n", '', exec_poll_php($item['arg1'], $using_proc_function, $pipes, $cactiphp)));
707                                 cacti_log("output[$output] ds[$data_source]","");
708                                 if (!is_numeric($output)) {
709                                 cacti_log("is_not_numeric[$output]","");
710                                         if (prepare_validate_result($output) == false) {
711                                                 if (strlen($output) > 20) {
712                                                         $strout = 20;
713                                                 } else {
714                                                         $strout = strlen($output);
715                                                 }
716
717                                                 $error_ds[$data_source] = $data_source;
**[...]**
454                                         cacti_log('WARNING: Invalid Response(s) (1), Errors[' . sizeof($error_ds) . '] Device[' . $last_host . '] Thread[1] DS[' . implode(', ', $error_ds) . ']', false, 'POLLER');
[...]
769                 cacti_log('WARNING: Invalid Response(s) (2), Errors[' . sizeof($error_ds) . '] Device[' . $last_host . '] Thread[1] DS[' . implode(', ', $error_ds) . ']', false, 'POLLER');

And here are the logs i get after running cmd.php manually :

2020/01/16 14:53:27 - CMDPHP output[PHP Script Server has Started - Parent is cmd] ds[1074627]                                                                                                                                       
2020/01/16 14:53:27 - CMDPHP is_not_numeric[PHP Script Server has Started - Parent is cmd]
2020/01/16 14:53:27 - CMDPHP output[62453669888] ds[1074627]
2020/01/16 14:53:27 - CMDPHP output[214759895040] ds[1074628]
2020/01/16 14:53:27 - CMDPHP output[183188824064] ds[1074628]
2020/01/16 14:53:27 - CMDPHP output[384808513536] ds[1074629]
2020/01/16 14:53:27 - CMDPHP output[88438829056] ds[1074629]
2020/01/16 14:53:27 - CMDPHP output[1199111663616] ds[1074630]
2020/01/16 14:53:27 - CMDPHP output[153237323776] ds[1074630]
2020/01/16 14:53:27 - CMDPHP output[429493518336] ds[1074631]
2020/01/16 14:53:27 - CMDPHP output[312297848832] ds[1074631]
2020/01/16 14:53:27 - CMDPHP output[429493518336] ds[1074632]
2020/01/16 14:53:27 - CMDPHP output[53145567232] ds[1074632]
2020/01/16 14:53:27 - CMDPHP output[214745153536] ds[1074633]
2020/01/16 14:53:27 - CMDPHP output[148907360256] ds[1074633]
2020/01/16 14:53:27 - CMDPHP output[214745153536] ds[1074634]
2020/01/16 14:53:27 - CMDPHP output[31986352128] ds[1074634]
2020/01/16 14:53:27 - CMDPHP output[34359148544] ds[1074635]
2020/01/16 14:53:27 - CMDPHP output[34916925440] ds[1074635]
2020/01/16 14:53:27 - POLLER: Poller[13] WARNING: Invalid Response(s) (2), Errors[1] Device[65695] Thread[1] DS[1074627]

From what i read here, data source 1074627 (thought cacti get a correct reading in the next iteration of the loop) gets broken because of script_server.php disclaimer message, and is then set as in error in $error_ds (717), which leads to this invalid response log.

Is this an expected behaviour ?

I dont know how close is the cmd.php to spine in its inner mechanisms, i modified it to try to give you more insight about this issue.

I believe the php script server shouldn't return "PHP Script Server has Started - Parent is cmd" when called if possible, or the output validating regarding this should be more strict.

netniV commented 4 years ago

That should really only be seen in a debug mode I thought say, but seems to be related to running cacti on windows for sure. Likely it was added in to diagnose something and has been a left over though in theory, script server should ignore all output that doesn't conform IMHO, @cigamit or @TheWitness may have thoughts on that as they have both been running stuff a lot longer than I have.

sysres-dev commented 4 years ago

Hi,

Commenting the lines that output "PHP Script Server has Started - [...]" in the script_server.php solves my problem. Now every graph is as it should be!

TheWitness commented 4 years ago

That is a red herring. I'm going to close this one.