Issues with polling loop may skip some datasources

leonardocuei commented 4 years ago

(have posted on forums.cacti.net, but also posting here to might get some light)

Hello Guys ... i'm running cacti 1.2.14 on a CentOS 8 box and, with cmd.php as the poller, everything works just fine. I'm doing some basic monitoring of a few linux boxes, nothing fancy, interfaces and some Host MIB graphs. MySQL is Community Version, 8.0.21. Cacti boost is disabled. Single local poller, no remote pollers at all.

I have also compiled spine (1.2.14, matching version to cacti) and, when trying to use it, i'm getting some crazy results ... it's kind of working, some graphs are working just fine on spine, but most of script queries/scripts data sources stop working completly. I have already searched forums and Google and tried several things:

1) MySQL max connections is appropriate 2) logs and RRD files seems to have appropriate permission because, as I told, some graphs just keep working fine. Logs are being written by spine successfully as well 3) have tried SUID the spine binary, as found on spine error message when running from cli without SUID, but that didn't changed anything

From the cacti.log files, I can see that when using cmd.php, the RRDsProcessed number matches exactly the number of "Data Sources" I have. The actual "DataSources" number is higher because I believe some RRDs have more than one information, just like traffic ones have in and out data. However, 148 matches exactly my data source count from the Web UI, which is basically the number of RRD files. Everything OK.

As soon as I change to spine, RRDsProcessed drops from 148 to 80, and I start getting some "poller output table not empty" warnings, however I simply cannot find what's causing that. I'm running log level on MEDIUM, have tried HIGH, but couldn't find anything that could help.

[root@firewall log]# tail -f cacti.log | egrep -i '(SYSTEM STATS: Time|empty)'
06/Oct/2020 11:06:24 - SYSTEM STATS: Time:82.7348 Method:cmd.php Processes:6 Threads:0 Hosts:14 HostsPerProcess:3 DataSources:206 RRDsProcessed:148
06/Oct/2020 11:11:25 - SYSTEM STATS: Time:82.7319 Method:cmd.php Processes:6 Threads:0 Hosts:14 HostsPerProcess:3 DataSources:206 RRDsProcessed:148
(changed poller to spine, all the rest is kept unchanged)
06/Oct/2020 11:15:30 - SYSTEM STATS: Time:28.7387 Method:spine Processes:6 Threads:16 Hosts:14 HostsPerProcess:3 DataSources:206 RRDsProcessed:80
06/Oct/2020 11:20:02 - POLLER: Poller[1] WARNING: Poller Output Table not Empty.  Issues: 8, DS[3110, 3113, 3114, 3117, 3122, 3125, 3156, 3159]
06/Oct/2020 11:20:31 - SYSTEM STATS: Time:28.6560 Method:spine Processes:6 Threads:16 Hosts:14 HostsPerProcess:3 DataSources:206 RRDsProcessed:81
06/Oct/2020 11:25:01 - POLLER: Poller[1] WARNING: Poller Output Table not Empty.  Issues: 10, DS[3106, 3109, 3110, 3113, 3114, 3117, 3122, 3125, 3156, 3159]
06/Oct/2020 11:30:02 - POLLER: Poller[1] WARNING: Poller Output Table not Empty.  Issues: 9, DS[3110, 3113, 3114, 3117, 3122, 3125, 3129, 3156, 3159]
06/Oct/2020 11:30:30 - SYSTEM STATS: Time:28.6543 Method:spine Processes:6 Threads:16 Hosts:14 HostsPerProcess:3 DataSources:206 RRDsProcessed:82

poller_output MySQL table is basically populated with traffic_out values, that might indicate something ... table is getting flushed between runs however, I can see that because 'time' values are always from last run and are not getting accumulated there

mysql> select * from poller_output;
+---------------+-------------+---------------------+-----------+
| local_data_id | rrd_name    | time                | output    |
+---------------+-------------+---------------------+-----------+
|          3110 | traffic_out | 2020-10-06 11:25:01 | 187044682 |
|          3113 | traffic_out | 2020-10-06 11:25:01 | 5442512   |
|          3114 | traffic_out | 2020-10-06 11:25:01 | 536826197 |
|          3117 | traffic_out | 2020-10-06 11:25:01 | 10807572  |
|          3122 | traffic_out | 2020-10-06 11:25:01 | 238920335 |
|          3125 | traffic_out | 2020-10-06 11:25:01 | 5464841   |
|          3129 | traffic_out | 2020-10-06 11:25:04 | 1519394   |
|          3156 | traffic_out | 2020-10-06 11:25:01 | 246043252 |
|          3159 | traffic_out | 2020-10-06 11:25:01 | 5407670   |
+---------------+-------------+---------------------+-----------+
9 rows in set (0.00 sec)

Other data sources, which are simply script ones, that works flawlessly with cmd.php, simply stop being called at all. They simply vanished from the logs. Example:

06/Oct/2020 11:05:27 - POLLER: Poller[1] Device[88] DS[2954] CMD: /var/www/html/admin/cacti/scripts/ping_latency.pl '177.x.x.x' '50', output: roundtrip:8 packetloss:0
06/Oct/2020 11:10:28 - POLLER: Poller[1] Device[88] DS[2954] CMD: /var/www/html/admin/cacti/scripts/ping_latency.pl '177.x.x.x' '50', output: roundtrip:11 packetloss:0
(changed to spine at 11:12, and data source 2954 is gone from the logs)

and of course, as soon I change back to cmd.php, RRDsProcessed gets back to its normal:

06/Oct/2020 11:36:23 - SYSTEM STATS: Time:80.7232 Method:cmd.php Processes:6 Threads:0 Hosts:14 HostsPerProcess:3 DataSources:206 RRDsProcessed:148

I have already used cacti/spine on older versions and never had similar problems, I'm really out of ideas here and would love some ideas from you guys :) THanks!

TheWitness commented 4 years ago

You've got a large number of threads and processes for just 14 hosts. Lower the number a bit to like 1 process 4 threads and see if the same thing happens.

leonardocuei commented 4 years ago

You've got a large number of threads and processes for just 14 hosts. Lower the number a bit to like 1 process 4 threads and see if the same thing happens.

Got it ... got some new hosts configured, but i'm on 25 now, for sure a low number of devices. Lowered Data Colletor/Default Data Collector (the only one) processes to 1 and threads to 4

Settings, Poller, changed poller type to spine, let's see ...

08/Oct/2020 21:51:54 - SYSTEM STATS: Time:113.0664 Method:cmd.php Processes:8 Threads:0 Hosts:25 HostsPerProcess:4 DataSources:452 RRDsProcessed:245 ...... 08/Oct/2020 22:05:16 - SYSTEM STATS: Time:14.3004 Method:spine Processes:1 Threads:4 Hosts:25 HostsPerProcess:25 DataSources:452 RRDsProcessed:138 08/Oct/2020 22:10:13 - SYSTEM STATS: Time:11.3119 Method:spine Processes:1 Threads:4 Hosts:25 HostsPerProcess:25 DataSources:452 RRDsProcessed:143

and RRDsProcessed falls from 245 to a little more than half of that even with processes/threads lowered as requested :( My scripted data sources stopped being pulled completly, there's not a trace of them on the logs anymore.

Found it interesting that RRDsProcessed got from 138 to 143 between runs. No device went up or down, absolutely nothing changed between those runs. Will let it run on spine for some cycles, maybe it will get back on tracks by itself ... and no, seems not. While the RRDsProcessed numbers always changes between runs, it's stuck there close to 140-150. And I'm now getting "Poller Output Table not Empty" messages among runs. Checked and all data sources referenced are the Traffic Interface one, a very basic data source. The exact data sources changes among runs, but always interface traffic data sources listed on "output table not empty"

08/Oct/2020 22:15:16 - SYSTEM STATS: Time:14.3059 Method:spine Processes:1 Threads:4 Hosts:25 HostsPerProcess:25 DataSources:452 RRDsProcessed:146 08/Oct/2020 22:20:01 - POLLER: Poller[1] WARNING: Poller Output Table not Empty. Issues: 13, DS[3110, 3128, 3148, 3150, 3152, 3246, 3320, 3327, 3335, 3343, 3346, 3347, 3354] 08/Oct/2020 22:20:17 - SYSTEM STATS: Time:15.3162 Method:spine Processes:1 Threads:4 Hosts:25 HostsPerProcess:25 DataSources:452 RRDsProcessed:142 08/Oct/2020 22:25:03 - POLLER: Poller[1] WARNING: Poller Output Table not Empty. Issues: 13, DS[3124, 3148, 3150, 3152, 3154, 3270, 3276, 3320, 3327, 3335, 3343, 3346, 3347] 08/Oct/2020 22:25:20 - SYSTEM STATS: Time:17.3080 Method:spine Processes:1 Threads:4 Hosts:25 HostsPerProcess:25 DataSources:452 RRDsProcessed:138 08/Oct/2020 22:30:01 - POLLER: Poller[1] WARNING: Poller Output Table not Empty. Issues: 9, DS[3134, 3148, 3154, 3320, 3327, 3335, 3343, 3346, 3347] 08/Oct/2020 22:30:10 - SYSTEM STATS: Time:9.2903 Method:spine Processes:1 Threads:4 Hosts:25 HostsPerProcess:25 DataSources:452 RRDsProcessed:141 08/Oct/2020 22:35:01 - POLLER: Poller[1] WARNING: Poller Output Table not Empty. Issues: 11, DS[3148, 3150, 3152, 3154, 3246, 3320, 3327, 3335, 3343, 3346, 3347] 08/Oct/2020 22:35:12 - SYSTEM STATS: Time:10.3022 Method:spine Processes:1 Threads:4 Hosts:25 HostsPerProcess:25 DataSources:452 RRDsProcessed:146

get back to cmd.php and RRDsProcessed gets back to expected immediately

08/Oct/2020 22:41:47 - SYSTEM STATS: Time:105.1662 Method:cmd.php Processes:8 Threads:0 Hosts:25 HostsPerProcess:4 DataSources:452 RRDsProcessed:245

so changing Processes/Threads to a more sane number didn't changed the behavior

netniV commented 4 years ago

If you up the logging levels, are you having any timeouts communicating with the devices?

leonardocuei commented 4 years ago

If you up the logging levels, are you having any timeouts communicating with the devices?

I'm running logging on MEDIUM and timeouts are not getting logged. Devices are not going down when spine is used and some graphs will kept being generated with no problems. It doesn't seem to be related to timeouts/down devices or failure on that up/down detection subsystems.

If you believe that might help, it's pretty easy for me to have a run on DEBUG logging on cmd.php and another one with spine.

TheWitness commented 4 years ago

Run this SQL command

SELECT rrd_step, COUNT(*) 
FROM data_template_data 
GROUP BY rrd_step;

There may be a template issue where inbound has one collection rate and outbound has another. Post what you get from that query first.

leonardocuei commented 4 years ago

Run this SQL command
SELECT rrd_step, COUNT(*) 
FROM data_template_data 
GROUP BY rrd_step;
There may be a template issue where inbound has one collection rate and outbound has another. Post what you get from that query first.

seems everything is 5 minutes ... i have never used a polling rate different from the standard 5 minutes since MRTG era :)


mysql> SELECT rrd_step, COUNT(*) FROM data_template_data GROUP BY rrd_step;
+----------+----------+
| rrd_step | COUNT(*) |
+----------+----------+
|      300 |      292 |
+----------+----------+
1 row in set (0.00 sec)

mysql>

TheWitness commented 4 years ago

Can you increase your snmp timeout?

TheWitness commented 4 years ago

Set max OIDS to 1 too.

leonardocuei commented 4 years ago

Can you increase your snmp timeout?

Sure! SNMP Timeout was changed in all devices from 2000 (milisseconds, i believe, i was using the default, so 2 seconds) to 5000 ms. Max OIDs was reduced to 1 as requested.

Data Collector was configured back to 1 process and 4 threads (as requested earlier) after changing Poller Type to spine.

Results are ... seems nothing changed, RRDsProcessed drops substantially as soon as spine runs for the 1st rime. First line is my cmd.php line for comparisons. Scripted data sources stop running immediately with no noticeable error on the logs (just like all the other tries). "Poller Output Table not Empty" appears again ...


10/Oct/2020 16:46:37 - SYSTEM STATS: Time:95.0332 Method:cmd.php Processes:8 Threads:0 Hosts:20 HostsPerProcess:3 DataSources:445 RRDsProcessed:237

10/Oct/2020 16:50:37 - SYSTEM STATS: Time:35.3381 Method:spine Processes:1 Threads:4 Hosts:20 HostsPerProcess:20 DataSources:445 RRDsProcessed:161
10/Oct/2020 16:55:01 - POLLER: Poller[1] WARNING: Poller Output Table not Empty.  Issues: 16, DS[3103, 3107, 3117, 3119, 3135, 3149, 3220, 3246, 3320, 3327, 3335, 3343, 3346, 3347, 3348, 3354]
10/Oct/2020 16:55:31 - SYSTEM STATS: Time:30.3252 Method:spine Processes:1 Threads:4 Hosts:20 HostsPerProcess:20 DataSources:445 RRDsProcessed:144
10/Oct/2020 17:00:02 - POLLER: Poller[1] WARNING: Poller Output Table not Empty.  Issues: 14, DS[3123, 3131, 3149, 3153, 3157, 3246, 3248, 3250, 3320, 3327, 3335, 3343, 3346, 3347]
10/Oct/2020 17:00:30 - SYSTEM STATS: Time:28.3686 Method:spine Processes:1 Threads:4 Hosts:20 HostsPerProcess:20 DataSources:445 RRDsProcessed:145
10/Oct/2020 17:05:01 - POLLER: Poller[1] WARNING: Poller Output Table not Empty.  Issues: 13, DS[3107, 3114, 3117, 3123, 3131, 3149, 3153, 3320, 3327, 3335, 3343, 3346, 3347]
10/Oct/2020 17:05:29 - SYSTEM STATS: Time:28.3464 Method:spine Processes:1 Threads:4 Hosts:20 HostsPerProcess:20 DataSources:445 RRDsProcessed:139
10/Oct/2020 17:10:01 - POLLER: Poller[1] WARNING: Poller Output Table not Empty.  Issues: 11, DS[3119, 3123, 3135, 3149, 3153, 3320, 3327, 3335, 3343, 3346, 3347]
10/Oct/2020 17:10:30 - SYSTEM STATS: Time:28.3322 Method:spine Processes:1 Threads:4 Hosts:20 HostsPerProcess:20 DataSources:445 RRDsProcessed:139
10/Oct/2020 17:15:01 - POLLER: Poller[1] WARNING: Poller Output Table not Empty.  Issues: 13, DS[3103, 3107, 3119, 3123, 3131, 3149, 3153, 3320, 3327, 3335, 3343, 3346, 3347]
10/Oct/2020 17:15:32 - SYSTEM STATS: Time:30.3295 Method:spine Processes:1 Threads:4 Hosts:20 HostsPerProcess:20 DataSources:445 RRDsProcessed:147

leonardocuei commented 4 years ago

I believe next step is choose a single device, disable all but that single device, and have a full run on cmd.php and another one with spine with DEBUG or DEVEL logging enabled. How about that? If you believe that can help, i'll be glad to provide them.

netniV commented 4 years ago

You are getting a different number of data sources being completed over different runs. That suggests that either some are working or some are not. I would suggest upping your logging level to see what is being polled successfully and how long it is taking to do them. To get a clear test, you may want to stop the automatic poller and run it manually.

leonardocuei commented 4 years ago

You are getting a different number of data sources being completed over different runs. That suggests that either some are working or some are not. I would suggest upping your logging level to see what is being polled successfully and how long it is taking to do them. To get a clear test, you may want to stop the automatic poller and run it manually.

Yes, I had noticed the "RRDsProcessed" number keeps changing while spine is selected. Funny is with cmd.php everything is fine and stable, which makes me believe there's not any big problem on any particular data source. I have already tried disabling all hosts, enabling a single one, changing to spine, changing log level to DEBUG, and there's absolutely nothing on the logs. Data sources are not being processed, not being logged, seems they are really being fully skipped somehow. Even with DEBUG logging, I couldn't find any reasonable information, the data source IDs that are not being pulled for example being logged. They are simply not being logged. They are not failing, they are not being pulled at all, not logged at all.

I'll try, this night, redo the 'one single host enabled' and having a full run on cmd.php and other on spine.php and attach them. To my best knowledge of being a sysadmin for almost 20 years, I wasn't able to figure out anything useful from the logs even on DEBUG level.

Will attach the logs as soon as I get them. Thanks!

leonardocuei commented 4 years ago

You are getting a different number of data sources being completed over different runs. That suggests that either some are working or some are not. I would suggest upping your logging level to see what is being polled successfully and how long it is taking to do them. To get a clear test, you may want to stop the automatic poller and run it manually.

do you think running on DEVEL logging would be better than DEBUG, or DEBUG is usually enough?

leonardocuei commented 4 years ago

cacti.log-debug.cmdphp.log cacti.log-debug.spine-run1.log cacti.log-debug.spine-run2.log cacti.log-debug.spine-run3.log

I have disabled all but 2 hosts, each one with exactly 7 data sources. Using cmd.php, as expected, I have "RRDsProcessed:14" after each run. Absolutely no problem on that.

1) i'm still running with Max OID per requests set to 1 as requested 2) got log level to DEBUG 3) changed poller to spine (1.2.14, matches cacti version) 4) adjusted number of processes to 1 and threads to 4 (as requested) 5) let cacti run (via cron) for the hosts and, as usual, I got different numbers in RRDsProcessed while spine was used.

I took the logs of each run (cmd.php and some spine runs) in different files. Main problem to spine seems to be the script and script queries data sources, despite absolutely no errors are being logged.

There's a specific data source (script/command one) that is absolutely not running while spine is selected. It's a ping command, which usually runs with 50 packets and 0,5s interval, so it can take up to 30 seconds to run. I have on settings, Poller, the option "Script and Script Queries Timeout" set to 45, so it shouldn't be a problem.

While watching closely the logs in DEBUG level, I noticed that when spine is being used, some data source IDs are apparently being queried more than it's supposed to. I believe by different threads. Example data source 3131, a "script query interface" DS. I should get two SNMP values: traffic_in and traffic_out. With cmd.php, I get exactly the two expected results:

[root@firewall log]# grep 3131 cacti.log-debug.cmdphp.log
13/Oct/2020 20:55:27 - POLLER: Poller[1] Device[86] DS[3131] SNMP: v2: 189.xx.xx.xx, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.3, output: 293477245884
13/Oct/2020 20:55:27 - POLLER: Poller[1] Device[86] DS[3131] SNMP: v2: 189.xx.xx.xx, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.3, output: 60285028371
13/Oct/2020 20:55:28 - POLLER: Poller[1] CACTI2RRD: /usr/bin/rrdtool update /var/www/html/admin/cacti/rra/jbj_unit1_traffic_in_3131.rrd --skip-past-updates --template traffic_in:traffic_out 1602633302:293477245884:60285028371
[root@firewall log]#

with spine, it seems I'm getting an extra 3131 line from different HTs (threads?)

[root@firewall log]# grep 3131 cacti.log-debug.spine-run1.log
2020/10/13 21:05:02 - SPINE: Poller[1] PID[2658962] Device[86] HT[2] DS[3131] SNMP: v2: 189.xx.xx.xx, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.3, value: 293485144584
2020/10/13 21:05:02 - SPINE: Poller[1] PID[2658962] Device[86] HT[2] DS[3131] SNMP: v2: 189.xx.xx.xx, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.3, value: 60286864160
13/Oct/2020 21:05:02 - POLLER: Poller[1] CACTI2RRD: /usr/bin/rrdtool update /var/www/html/admin/cacti/rra/jbj_unit1_traffic_in_3131.rrd --skip-past-updates --template traffic_in:traffic_out 1602633902:293485144584:60286864160
2020/10/13 21:05:02 - SPINE: Poller[1] PID[2658962] Device[86] HT[1] DS[3131] SNMP: v2: 189.xx.xx.xx, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.3, value: 293485144584
[root@firewall log]#

Maybe spine is getting lost with some index, running some DSs more than once, and leaving some DSs behind because it believes to had run everything needed? Would that make any sense?

About my script/command data sources, as said, they are really not being polled at all while using spine as poller. Here's log (from DEBUG mode) using cmd.php

[root@firewall log]# grep latency cacti.log-debug.cmdphp.log
13/Oct/2020 20:55:27 - POLLER: Poller[1] Device[86] DS[2952] CMD: /var/www/html/admin/cacti/scripts/ping_latency.pl '189.xx.xx.xx' '50', output: roundtrip:1 packetloss:0
13/Oct/2020 20:55:27 - POLLER: Poller[1] Device[88] DS[2954] CMD: /var/www/html/admin/cacti/scripts/ping_latency.pl '177.xx.xx.xx' '50', output: roundtrip:4 packetloss:0
[root@firewall log]#

I have 3 separate log from spine runs, with debug logging, and "latency" (the script name) simply do not appear on them. The data source ID, neither as well.

[root@firewall log]# grep latency cacti.log-debug.spine-run?.log
[root@firewall log]# grep 2952 cacti.log-debug.spine-run?.log
[root@firewall log]# grep 2954 cacti.log-debug.spine-run?.log
[root@firewall log]#

I'm really lost ... by everything that I could try here, it's not about spine not being able to get the data source values, by some permission failure or anything like that. While spine is used, there are some DSs that are not being queried at all. No errors, no warning, nothing. They are simply gone from the run (and logs).

I'll attach logs from a full DEBUG run, 2 hosts only, with cmd.php and 3 different runs with spine, all with DEBUG logging. I really can't see where I'm failing, if that's the case, in some setup or anything like that. And I thank you very much for your attention on this one :)

TheWitness commented 3 years ago

I've got little time to work on this at the moment. The latest spine includes some timing information if you turn logging to level 5.

./spine -R -V 5 -S -H hostid

Do that for the host id's that have issues. You should be able to ferret it out.

leonardocuei commented 3 years ago

I've got little time to work on this at the moment. The latest spine includes some timing information if you turn logging to level 5.
./spine -R -V 5 -S -H hostid
Do that for the host id's that have issues. You should be able to ferret it out.

will give it a try and keep you informed. I'm kind of busy this week, but will try it on the next days. Thanks for the feedback!

leonardocuei commented 3 years ago

I've got little time to work on this at the moment. The latest spine includes some timing information if you turn logging to level 5.
./spine -R -V 5 -S -H hostid
Do that for the host id's that have issues. You should be able to ferret it out.

upgraded cacti and spine to 1.2.16 and nothing changed from the previous reports. As soon as I change from cmd.php to spine, number of RRDsProcessed dropped:

16/Feb/2021 22:11:36 - SYSTEM STATS: Time:94.9639 Method:cmd.php Processes:8 Threads:1 Hosts:18 HostsPerProcess:3 DataSources:433 RRDsProcessed:229
16/Feb/2021 22:16:22 - SYSTEM STATS: Time:79.9228 Method:cmd.php Processes:8 Threads:1 Hosts:18 HostsPerProcess:3 DataSources:433 RRDsProcessed:235
16/Feb/2021 22:20:38 - SYSTEM STATS: Time:36.8496 Method:spine Processes:8 Threads:1 Hosts:18 HostsPerProcess:3 DataSources:433 RRDsProcessed:147
(forget to make adjustments on processes, threads, max oids, that were done after this first spine run but nothing really changed)
16/Feb/2021 22:25:28 - SYSTEM STATS: Time:26.3303 Method:spine Processes:1 Threads:4 Hosts:18 HostsPerProcess:18 DataSources:433 RRDsProcessed:146

Tried running spine to a single host, as requested, and it runs fine, but is clearly skipping, that I can see, one data source: (please note there's no sign of data source 2951 on 22:20 and 22:25 timestamps, in which spine was selected)

16/Feb/2021 22:06:34 - POLLER: Poller[1] Device[31] DS[2951] TT[25058.37] CMD: /var/www/html/admin/cacti/scripts/ping_latency.pl '10.100.10.1' '50', output: roundtrip:71 packetloss:4
16/Feb/2021 22:11:34 - POLLER: Poller[1] Device[31] DS[2951] TT[25056.42] CMD: /var/www/html/admin/cacti/scripts/ping_latency.pl '10.100.10.1' '50', output: roundtrip:72 packetloss:2
16/Feb/2021 22:16:20 - POLLER: Poller[1] Device[31] DS[2951] TT[25036.06] CMD: /var/www/html/admin/cacti/scripts/ping_latency.pl '10.100.10.1' '50', output: roundtrip:73 packetloss:6
16/Feb/2021 22:31:20 - POLLER: Poller[1] Device[31] DS[2951] TT[25057] CMD: /var/www/html/admin/cacti/scripts/ping_latency.pl '10.100.10.1' '50', output: roundtrip:69 packetloss:4

Attached is the spine run for a single host, log was set to DEBUG. There's nothing special or interesting on the logs, nothing caught my attention. The only information that caught my eyes is the very last one:

Time: 6.3254 s, Threads: 4, Devices: 2 2 devices??? No, no ... only one. Maybe the device counter is getting crazy and, because of that, some data sources are being ignored?? .

debug-run.txt

leonardocuei commented 3 years ago

is there some limitation/restriction on spine running perl CMD commands? It seems to be running a SINGLE data source with perl script on each run, while on cmd.php it would be 11 data sources with perl scripts ... on the debug-log attached, the ping_latency.pl script is not called at all (yes it's there, confirmed here). From the attached logs, data source 2951 is simply gone, it's never even called.

My PHP scripts, on the other hand, works just fine when spine is used.

[root@firewall log]# grep ping_latency.pl cacti.log | grep spine
2021/02/16 22:20:27 - SPINE: Poller[1] PID[146800] Device[157] HT[1] DS[3416] TT[24648.14] SCRIPT: /var/www/html/admin/cacti/scripts/ping_latency.pl '10.100.17.1' '50', output: roundtrip:52 packetloss:0
2021/02/16 22:25:27 - SPINE: Poller[1] PID[147027] Device[157] HT[1] DS[3416] TT[24618.23] SCRIPT: /var/www/html/admin/cacti/scripts/ping_latency.pl '10.100.17.1' '50', output: roundtrip:51 packetloss:0
[root@firewall log]#

.

[root@firewall log]# grep ping_latency.pl cacti.log  | grep "16/Feb/2021 22:3[01]" | wc -l
11
[root@firewall log]# grep ping_latency.pl cacti.log  | grep "16/Feb/2021 22:3[56]" | wc -l
11
[root@firewall log]# grep ping_latency.pl cacti.log  | grep "16/Feb/2021 22:4[01]" | wc -l
11
[root@firewall log]#

TheWitness commented 3 years ago

with spine, you have to output all content in one go. Also, note that there are two compile options for spine, one is nifty popen and the other is traditional popen. The popen initializes an environment, while nifty does not. So if you script requires and environment, then you should use the popen option, but it will also limit scalability as well.

leonardocuei commented 3 years ago

with spine, you have to output all content in one go. Also, note that there are two compile options for spine, one is nifty popen and the other is traditional popen. The popen initializes an environment, while nifty does not. So if you script requires and environment, then you should use the popen option, but it will also limit scalability as well.

tried recompiling spine with "--enable-popen" and manually run for a single host, see the logging changes (from NITFY POPEN to POPEN) but data source keeps being ignored. Checked my script and it outputs all information on a single 'print' line, I believe that's 'all in one go'.

The fact that manual spine run shows nothing (no errors, no anything) about those data sources, any idea on how to debug those?

TheWitness commented 3 years ago

See if it's missing from the poller cache.

TheWitness commented 3 years ago

Also, run spine as follows

./spine -V 3 -R -S -H id

Replacing id with the device id.

leonardocuei commented 3 years ago

think I found something ... and it seems to be MySQL query optimization related

watching SELECT queries from the cacti debug output, I got (queries are the same, just changing the LIMIT at the end, which uses the adjusted number of threads for the host)

DEVDBG: SQL:SELECT action, hostname, snmp_community, snmp_version, snmp_username, snmp_password, rrd_name, rrd_path, arg1, arg2, arg3, local_data_id, rrd_num, snmp_port, snmp_timeout, snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_engine_id  FROM poller_item WHERE host_id=31 AND rrd_next_step <=0 AND poller_id=1 ORDER by snmp_port LIMIT 0, 3
DEVDBG: SQL:SELECT snmp_port, count(snmp_port) FROM poller_item WHERE host_id=31 AND rrd_next_step <=0 AND poller_id=1 GROUP BY snmp_port LIMIT 3, 3
DEVDBG: SQL:SELECT action, hostname, snmp_community, snmp_version, snmp_username, snmp_password, rrd_name, rrd_path, arg1, arg2, arg3, local_data_id, rrd_num, snmp_port, snmp_timeout, snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_engine_id  FROM poller_item WHERE host_id=31 AND rrd_next_step <=0 AND poller_id=1 ORDER by snmp_port LIMIT 3, 3
DEVDBG: SQL:SELECT snmp_port, count(snmp_port) FROM poller_item WHERE host_id=31 AND rrd_next_step <=0 AND poller_id=1 GROUP BY snmp_port LIMIT 6, 3
DEVDBG: SQL:SELECT action, hostname, snmp_community, snmp_version, snmp_username, snmp_password, rrd_name, rrd_path, arg1, arg2, arg3, local_data_id, rrd_num, snmp_port, snmp_timeout, snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_engine_id  FROM poller_item WHERE host_id=31 AND rrd_next_step <=0 AND poller_id=1 ORDER by snmp_port LIMIT 6, 3
DEVDBG: SQL:SELECT snmp_port, count(snmp_port) FROM poller_item WHERE host_id=31 AND rrd_next_step <=0 AND poller_id=1 GROUP BY snmp_port LIMIT 9, 3
DEVDBG: SQL:SELECT action, hostname, snmp_community, snmp_version, snmp_username, snmp_password, rrd_name, rrd_path, arg1, arg2, arg3, local_data_id, rrd_num, snmp_port, snmp_timeout, snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_engine_id  FROM poller_item WHERE host_id=31 AND rrd_next_step <=0 AND poller_id=1 ORDER by snmp_port LIMIT 9, 3

focusing on the "SELECT action,hostname..." query ... if I run that without the LIMIT clause, I correctly get the 9 datasources for that host: (including the one i'm missing)

However, if I run the 4 queries (LIMIT 0,3, LIMIT 3,3, LIMIT 6,3 and LIMIT 9,3), that data source i'm missing is never returned and some data sources are repeated on the separated runs.

From MySQL "Limit Query Optimization" page (https://dev.mysql.com/doc/refman/8.0/en/limit-optimization.html), I got some good information about this situation, in which a query using LIMIT might return different results than the regular query (without LIMIT).

If multiple rows have identical values in the ORDER BY columns, the server is free to return those rows in any order, and may do so differently depending on the overall execution plan. In other words, the sort order of those rows is nondeterministic with respect to the nonordered columns. [ ..... ] If it is important to ensure the same row order with and without LIMIT, include additional columns in the ORDER BY clause to make the order deterministic.

That all being said, i edited poller.c and added another sort criteria, to make order unique (as requested by MySQL documentation). I used local_data_id for that, that is surely unique. Tried running the queries manually with the new 'order by' criterias and now all data sources are returned with the LIMIT being used, as expected, including the one I was missing.

[root@firewall cacti-spine-1.2.16]# diff -Naur poller.c~ poller.c
--- poller.c~   2020-11-30 15:21:59.000000000 -0300
+++ poller.c    2021-02-17 09:53:53.318527589 -0300
@@ -260,7 +260,7 @@
                        " FROM poller_item"
                        " WHERE host_id = %i"
                        " AND deleted = ''"
-                       " ORDER BY snmp_port %s", host_id, limits);
+                       " ORDER BY local_data_id,snmp_port %s", host_id, limits);

                /* host structure for uptime checks */
                snprintf(query2, BIG_BUFSIZE,
@@ -292,7 +292,7 @@
                                "snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_engine_id "
                        " FROM poller_item"
                        " WHERE host_id=%i AND rrd_next_step <=0"
-                       " ORDER by snmp_port %s", host_id, limits);
+                       " ORDER by local_data_id,snmp_port %s", host_id, limits);

                /* query to setup the next polling interval in cacti */
                snprintf(query6, BUFSIZE,
@@ -332,7 +332,7 @@
                                "snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_engine_id "
                        " FROM poller_item"
                        " WHERE host_id=%i AND poller_id=%i"
-                       " ORDER BY snmp_port %s", host_id, set.poller_id, limits);
+                       " ORDER BY local_data_id,snmp_port %s", host_id, set.poller_id, limits);

                /* host structure for uptime checks */
                snprintf(query2, BIG_BUFSIZE,
@@ -364,7 +364,7 @@
                                "snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_engine_id "
                        " FROM poller_item"
                        " WHERE host_id=%i AND rrd_next_step <=0 AND poller_id=%i"
-                       " ORDER by snmp_port %s", host_id, set.poller_id, limits);
+                       " ORDER by local_data_id,snmp_port %s", host_id, set.poller_id, limits);

                /* query to setup the next polling interval in cacti */
                snprintf(query6, BUFSIZE,

Recompiled (without using "--enable-popen", so using the default nifty popen), ran spine manually ... and now ALL data sources are there!!!! :) So I think I found the problem. It was never spine related, but instead it was MySQL query optimization related and, given the exact query spine was using, the exact return order (as spine was expecting) might not happen. Adding a new sort criteria to ensure that seems to solve the problem.

If you believe that new sort criteria (or a different one if needed) to make the results deterministic won't impact the overal performance, please consider adding that change to the spine code. At least here, it fully solved this problem that really MIGHT happen given MySQL documentation.

As an extra information, I'm running MySQL 8.0, don't know if those optimizations changed among MySQL versions.

TheWitness commented 3 years ago

Please do a pull request againt 1.2.x please.

leonardocuei commented 3 years ago

I don't know how to do that ... will try it.

Thanks for all your attention!

leonardocuei commented 3 years ago

Please do a pull request againt 1.2.x please.

it turns out simply adding the "local_data_id" wasn't enough. Some things were still being skipped, for example, a traffic_in on a traffic graph was successfully queried but the traffic_out wasn't even queried. Seems local_data_id, alone, isn't enough to make order unique, as a traffic graphic, for example, will be in two different entries with the same local_data_id, one for traffic_in and the other for traffic_out. That new problem was being present by lots of: (which are on my initial reports as well)

POLLER: Poller[1] WARNING: Poller Output Table not Empty. Issues: 6, DS[3732, 3745, 3758, 3771, 3832, 3845]

That being said, I changed again the 'ORDER BY' clausule, keeping snmp_port as the first one, as requested. And now, finally, things are 100% OK.

I tried to create the pull request but, honestly, I don't know how to do that. Would you mind creating it, with the attached patch, if it's OK to you? Thanks!

--- poller.c.orig       2020-11-30 15:21:59.000000000 -0300
+++ poller.c    2021-02-18 00:18:15.491722901 -0300
@@ -260,7 +260,7 @@
                        " FROM poller_item"
                        " WHERE host_id = %i"
                        " AND deleted = ''"
-                       " ORDER BY snmp_port %s", host_id, limits);
+                       " ORDER BY snmp_port, arg1, rrd_path %s", host_id, limits);

                /* host structure for uptime checks */
                snprintf(query2, BIG_BUFSIZE,
@@ -292,7 +292,7 @@
                                "snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_engine_id "
                        " FROM poller_item"
                        " WHERE host_id=%i AND rrd_next_step <=0"
-                       " ORDER by snmp_port %s", host_id, limits);
+                       " ORDER BY snmp_port, arg1, rrd_path %s", host_id, limits);

                /* query to setup the next polling interval in cacti */
                snprintf(query6, BUFSIZE,
@@ -332,7 +332,7 @@
                                "snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_engine_id "
                        " FROM poller_item"
                        " WHERE host_id=%i AND poller_id=%i"
-                       " ORDER BY snmp_port %s", host_id, set.poller_id, limits);
+                       " ORDER BY snmp_port, arg1, rrd_path %s", host_id, set.poller_id, limits);

                /* host structure for uptime checks */
                snprintf(query2, BIG_BUFSIZE,
@@ -364,7 +364,7 @@
                                "snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_engine_id "
                        " FROM poller_item"
                        " WHERE host_id=%i AND rrd_next_step <=0 AND poller_id=%i"
-                       " ORDER by snmp_port %s", host_id, set.poller_id, limits);
+                       " ORDER BY snmp_port, arg1, rrd_path %s", host_id, set.poller_id, limits);

                /* query to setup the next polling interval in cacti */
                snprintf(query6, BUFSIZE,

TheWitness commented 3 years ago

Ordering by arg1 ain't gonna happen. host_id, snmp_port, local_data_id is fine. Somehow, I think you have another problem, maybe its a cacti bug. What is your poller frequency, what the frequency on the profile in use? Did you change either?

leonardocuei commented 3 years ago

Ordering by arg1 ain't gonna happen. host_id, snmp_port, local_data_id is fine. Somehow, I think you have another problem, maybe its a cacti bug. What is your poller frequency, what the frequency on the profile in use? Did you change either?

5 minutes poller, nothing fancy, completly standard in that way. host_id, snmp_port and local_data_id, I believe, will still keep the chance of MySQL returning data out of the expected order, as those would be the same for a traffic graphic for example, with one entry for traffic_in and another one for traffic_out. host_id, snmp_port and local_data_id would be the same for both traffic_xxx, which wouldn't guarantee their unique order.

This order return on the MySQL queries clearly changed at some point on the MySQL code. I have another 'low-to-medium' size cacti instance running on MySQL 5.7 and never had to worry about this, regular 'ORDER BY' on spine code worked just fine. It seems very clear to me that, at some point after 5.7, this has changed drastically. (i'm having these problems on MySQL 8.0)

If arg1 can't be used, then OK. But the proposed ones, as explained, i don't believe would be enough to really keep their uniqueness and, thus, guarantee MySQL would return everything on the expected order.

leonardocuei commented 3 years ago

Would rrd_name be a good sort criteria? That, with host_id and local_data_id, would make entries unique.

TheWitness commented 3 years ago

I don't think you fully understand, the order is not that important outside of snmp port and host_id. That parent poller knows that traffic_in and traffic_out are a pair, and as long as they are both polled, everything works. You did not answer the second part of my question BTW.

leonardocuei commented 3 years ago

I don't think you fully understand, the order is not that important outside of snmp port and host_id. That parent poller knows that traffic_in and traffic_out are a pair, and as long as they are both polled, everything works. You did not answer the second part of my question BTW.

(sorry for the long post, I know i'm already taking too much of your time on this situation)

sorry for missing the second question ... do you mean the 'data source profiles'? If yes, it's 5 minutes (with 10 minutes heartbeat), the default one, never changed that. Poller is also being called from a cron task every 5 minutes.

About the order, I can see spine knows they are a pair. Thing is, and I found on MySQL 8.0 and it's even described on the MySQL docs, if the 'ORDER BY' on the query do not guarantee uniqueness among rows, LIMIT might return items out of the expected order. That's exactly what I was facing here. As the standard ORDER BY criterias wouldn't be enough to guarantee records would be unique, some items (from the poller_item table) were never being returned to the spine poller and, thus, never polled. I have no doubt there's absolutely (and never had) no spine bug involved, all the problems were caused by how MySQL return itens on the presence of LIMIT clauses and specific ORDER BY criterias that do not guarantee uniqueness of rows (as the MySQL documentation I posted earlier). So while not a spine bug, the MySQL queries, given some MySQL circunstances, might result on data being skipped because spine never received those.

Let me post some real queries, on my database, running MySQL 8.0. I'll use a simplified query, for ease of understanding only, ordered only by snmp_port, without LIMIT clause:

mysql> SELECT hostname, rrd_name, arg1, local_data_id, snmp_port  FROM poller_item WHERE host_id=29 AND rrd_next_step <=0 AND poller_id=1 ORDER BY snmp_port;
+------------+-------------------+---------------------------------------------------------------------+---------------+-----------+
| hostname   | rrd_name          | arg1                                                                | local_data_id | snmp_port |
+------------+-------------------+---------------------------------------------------------------------+---------------+-----------+
| 10.100.6.1 |                   | /var/www/html/admin/cacti/scripts/ping_latency.pl '10.100.6.1' '50' |          2948 |       161 |
| 10.100.6.1 | host_uptime_linux | .1.3.6.1.2.1.25.1.1.0                                               |          3020 |       161 |
| 10.100.6.1 | traffic_in        | .1.3.6.1.2.1.31.1.1.1.6.2                                           |          3114 |       161 |
| 10.100.6.1 | traffic_out       | .1.3.6.1.2.1.31.1.1.1.10.2                                          |          3114 |       161 |
| 10.100.6.1 | traffic_in        | .1.3.6.1.2.1.31.1.1.1.6.3                                           |          3115 |       161 |
| 10.100.6.1 | traffic_out       | .1.3.6.1.2.1.31.1.1.1.10.3                                          |          3115 |       161 |
| 10.100.6.1 | traffic_in        | .1.3.6.1.2.1.31.1.1.1.6.71                                          |          3117 |       161 |
| 10.100.6.1 | traffic_out       | .1.3.6.1.2.1.31.1.1.1.10.71                                         |          3117 |       161 |
+------------+-------------------+---------------------------------------------------------------------+---------------+-----------+
8 rows in set (0.00 sec)

I'll now redo the very same query but now using LIMIT like spine does to split data among threads. I'll try to split the dataset in 3 pieces:

mysql> SELECT hostname, rrd_name, arg1, local_data_id, snmp_port  FROM poller_item WHERE host_id=29 AND rrd_next_step <=0 AND poller_id=1 ORDER BY snmp_port LIMIT 0,3;
+------------+-------------+-----------------------------+---------------+-----------+
| hostname   | rrd_name    | arg1                        | local_data_id | snmp_port |
+------------+-------------+-----------------------------+---------------+-----------+
| 10.100.6.1 | traffic_out | .1.3.6.1.2.1.31.1.1.1.10.71 |          3117 |       161 |
| 10.100.6.1 | traffic_in  | .1.3.6.1.2.1.31.1.1.1.6.3   |          3115 |       161 |
| 10.100.6.1 | traffic_in  | .1.3.6.1.2.1.31.1.1.1.6.2   |          3114 |       161 |
+------------+-------------+-----------------------------+---------------+-----------+
3 rows in set (0.00 sec)

mysql> SELECT hostname, rrd_name, arg1, local_data_id, snmp_port  FROM poller_item WHERE host_id=29 AND rrd_next_step <=0 AND poller_id=1 ORDER BY snmp_port LIMIT 3,3;
+------------+-------------+----------------------------+---------------+-----------+
| hostname   | rrd_name    | arg1                       | local_data_id | snmp_port |
+------------+-------------+----------------------------+---------------+-----------+
| 10.100.6.1 | traffic_out | .1.3.6.1.2.1.31.1.1.1.10.2 |          3114 |       161 |
| 10.100.6.1 | traffic_in  | .1.3.6.1.2.1.31.1.1.1.6.3  |          3115 |       161 |
| 10.100.6.1 | traffic_out | .1.3.6.1.2.1.31.1.1.1.10.3 |          3115 |       161 |
+------------+-------------+----------------------------+---------------+-----------+
3 rows in set (0.00 sec)

mysql> SELECT hostname, rrd_name, arg1, local_data_id, snmp_port  FROM poller_item WHERE host_id=29 AND rrd_next_step <=0 AND poller_id=1 ORDER BY snmp_port LIMIT 6,3;
+------------+-------------+-----------------------------+---------------+-----------+
| hostname   | rrd_name    | arg1                        | local_data_id | snmp_port |
+------------+-------------+-----------------------------+---------------+-----------+
| 10.100.6.1 | traffic_in  | .1.3.6.1.2.1.31.1.1.1.6.71  |          3117 |       161 |
| 10.100.6.1 | traffic_out | .1.3.6.1.2.1.31.1.1.1.10.71 |          3117 |       161 |
+------------+-------------+-----------------------------+---------------+-----------+
2 rows in set (0.00 sec)

Please note that splitting the dataset with MySQL LIMIT, the local_data_id 3117 is now returned 3 times, while it was supposed to appear 2 times only. local_data_id = 2948 was NEVER returned. And this is the root of all problems I have reported. The way MySQL is returning the data, some data was never seen by the spine poller and, thus, never polled, resulting on some datasources being simply skipped (this 2948, for example, would be plain skipped on this run).

There's no spine bug, it never had, sorry for saying that initially. But given how MySQL 8.0 is returning data and the exact queries spine is using, some information MIGHT never reaches spine, and that's documented. I was able to "fix" that on my setup by tweaking ORDER BY criteria in a way that rows were really unique and, that way, LIMIT wouldn't miss data.

Challenge now would be find a good ORDER BY set that wouldn't kill performance, as it was the case with arg1 as I suggested. host_id, local_data_id and snmp_port, unfortunelly, doesn't seem to be good enough to guarantee uniqueness of returned rows.

TheWitness commented 3 years ago

Limit should start at 1 and not 0. That's odd. Are you assuming that or did you observe that?

leonardocuei commented 3 years ago

Limit should start at 1 and not 0. That's odd. Are you assuming that or did you observe that?

I observed that, there was always one LIMIT clause starting with zero. Will compile a fresh 1.2.16 spine and post full logs in minutes.

TheWitness commented 3 years ago

Well, that is wrong. Interesting.

leonardocuei commented 3 years ago

compiled a fresh binary from 1.2.16 spine source, called it (spine-vanilla), did a manual run and grepped for LIMIT on the output. Please note the original 'ORDER BY' clause, showing this is not the local tweaked version i'm running. Last queries are 'LIMIT 0,something' and not '1,something'.

(full log attached) spine-1.2.16-original-loglevel10.txt

[cacti@firewall cacti]$ ./spine-vanilla --version
SPINE 1.2.16  Copyright 2004-2020 by The Cacti Group
[cacti@firewall cacti]$ ./spine-vanilla -R -V 10 -S -H 160  > /dev/shm/spine-1.2.16-original-loglevel10.txt

[cacti@firewall cacti]$ grep -i limit /dev/shm/spine-1.2.16-original-loglevel10.txt
DEVDBG: SQL:SELECT action FROM poller_item WHERE action=2 AND host_id IN(160) LIMIT 1
DEVDBG: SQL:SELECT snmp_port, count(snmp_port) FROM poller_item WHERE host_id=160 AND rrd_next_step <=0 AND poller_id=1 GROUP BY snmp_port LIMIT 14, 14
DEVDBG: SQL:SELECT action, hostname, snmp_community, snmp_version, snmp_username, snmp_password, rrd_name, rrd_path, arg1, arg2, arg3, local_data_id, rrd_num, snmp_port, snmp_timeout, snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_engine_id  FROM poller_item WHERE host_id=160 AND rrd_next_step <=0 AND poller_id=1 ORDER by snmp_port LIMIT 14, 14
DEVDBG: SQL:SELECT snmp_port, count(snmp_port) FROM poller_item WHERE host_id=160 AND rrd_next_step <=0 AND poller_id=1 GROUP BY snmp_port LIMIT 28, 14
DEVDBG: SQL:SELECT action, hostname, snmp_community, snmp_version, snmp_username, snmp_password, rrd_name, rrd_path, arg1, arg2, arg3, local_data_id, rrd_num, snmp_port, snmp_timeout, snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_engine_id  FROM poller_item WHERE host_id=160 AND rrd_next_step <=0 AND poller_id=1 ORDER by snmp_port LIMIT 28, 14
DEVDBG: SQL:SELECT snmp_port, count(snmp_port) FROM poller_item WHERE host_id=160 AND rrd_next_step <=0 AND poller_id=1 GROUP BY snmp_port LIMIT 42, 14
DEVDBG: SQL:SELECT action, hostname, snmp_community, snmp_version, snmp_username, snmp_password, rrd_name, rrd_path, arg1, arg2, arg3, local_data_id, rrd_num, snmp_port, snmp_timeout, snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_engine_id  FROM poller_item WHERE host_id=160 AND rrd_next_step <=0 AND poller_id=1 ORDER by snmp_port LIMIT 42, 14
DEVDBG: SQL:SELECT snmp_port, count(snmp_port) FROM poller_item WHERE host_id=160 AND rrd_next_step <=0 AND poller_id=1 GROUP BY snmp_port LIMIT 0, 14
DEVDBG: SQL:SELECT action, hostname, snmp_community, snmp_version, snmp_username, snmp_password, rrd_name, rrd_path, arg1, arg2, arg3, local_data_id, rrd_num, snmp_port, snmp_timeout, snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_engine_id  FROM poller_item WHERE host_id=160 AND rrd_next_step <=0 AND poller_id=1 ORDER by snmp_port LIMIT 0, 14
[cacti@firewall cacti]$

leonardocuei commented 3 years ago

Well, that is wrong. Interesting.

from the MySQL documentation, it seems that LIMIT 0 is correct if you're looking for the first row, and not 1. That being said, LIMIT 0 appears to be correct.

https://dev.mysql.com/doc/refman/8.0/en/select.html

The LIMIT clause can be used to constrain the number of rows returned by the SELECT statement. ..... With two arguments, the first argument specifies the offset of the first row to return, and the second specifies the maximum number of rows to return. The offset of the initial row is 0 (not 1):

TheWitness commented 3 years ago

Yea, I was smoking dope on the limit issue. Too little coffee, too little time. I would like you to revert to the latest 1.2.x branch, and then for the device, switch to Device Threads 1, and let me know if the issue goes away. I think the bug might in in lib/poller.php in cacti. This will confirm it. It might also be 'masked' by using boost.

Once you setup that test, it'll let me know if the problem is actually there.

leonardocuei commented 3 years ago

will try the requested changes and get the new data asap. boost is not enabled, just to let you know.

TheWitness commented 3 years ago

Yea, I think if boost was enabled, you may have never seen this one.

TheWitness commented 3 years ago

What we might have to do is ensure that no data source spans a device thread. Once your testing comes back, I can give it more though. Some data sources have multiple items that should be inserted in one transaction, but if they are split across threads, they can get inserted at different times.

The lib/poller.php might be a bug, but when it comes to graphing, even using boost, an update might be lost.

leonardocuei commented 3 years ago

doing the tests here, tried to compile 1.2.x spine tree, but get compile errors on sql.c. Had to revert a recent change to get it to compile:

[root@firewall spine-1.2.x]# diff -Naur sql.c.orig sql.c
--- sql.c.orig  2021-02-22 09:59:10.000000000 -0300
+++ sql.c       2021-02-23 20:00:21.971053791 -0300
@@ -214,7 +214,7 @@
        int     options_error;
        int     success;
        int     error;
-       my_bool reconnect;
+       bool reconnect;
        MYSQL   *connect_error;
        char    *hostname = NULL;
        char    *socket = NULL;
[root@firewall spine-1.2.x]#

look on 1.2.16 code, and it's 'bool reconnect', which always compiled fine. Couldn't get it to compile with 'my_bool' Will do the tests requested with this 1.2.x compiled after changing from my_bool to bool.

leonardocuei commented 3 years ago

compiled spine 1.2.x (with the change described below), changed ALL devices "Device Threads" to 1, and let it run for 80 minutes. Got a single 'poller output not empty' with a single DS. Data Collector is unchanged, 1 process and 5 threads. Besides from that, everything looks smooth:

23/Feb/2021 20:21:20 - SYSTEM STATS: Time:78.4064 Method:spine Processes:1 Threads:5 Hosts:29 HostsPerProcess:29 DataSources:834 RRDsProcessed:440
23/Feb/2021 20:26:24 - SYSTEM STATS: Time:82.4059 Method:spine Processes:1 Threads:5 Hosts:29 HostsPerProcess:29 DataSources:834 RRDsProcessed:430
23/Feb/2021 20:30:01 - POLLER: Poller[1] WARNING: Poller Output Table not Empty.  Issues: 1, DS[3152]
23/Feb/2021 20:31:41 - SYSTEM STATS: Time:99.3930 Method:spine Processes:1 Threads:5 Hosts:29 HostsPerProcess:29 DataSources:834 RRDsProcessed:440
23/Feb/2021 20:36:20 - SYSTEM STATS: Time:78.3952 Method:spine Processes:1 Threads:5 Hosts:29 HostsPerProcess:29 DataSources:834 RRDsProcessed:440
23/Feb/2021 20:41:21 - SYSTEM STATS: Time:79.4086 Method:spine Processes:1 Threads:5 Hosts:29 HostsPerProcess:29 DataSources:834 RRDsProcessed:440

get "Device Threads" back to 4 for all devices and, immediatelly, problems described since the beginning appears again: RRDsProcessed dropped from 440 to 250-260-270, Poller output not empty started appear frequently, with several DSs.

23/Feb/2021 21:36:19 - SYSTEM STATS: Time:77.3784 Method:spine Processes:1 Threads:5 Hosts:29 HostsPerProcess:29 DataSources:834 RRDsProcessed:434
23/Feb/2021 21:40:57 - SYSTEM STATS: Time:56.3750 Method:spine Processes:1 Threads:5 Hosts:29 HostsPerProcess:29 DataSources:834 RRDsProcessed:256
23/Feb/2021 21:45:01 - POLLER: Poller[1] WARNING: Poller Output Table not Empty.  Issues: 12, DS[3107, 3111, 3146, 3149, 3153, 3157, 4006, 4011, 4090, 4099, 4107, 4143]
23/Feb/2021 21:45:47 - SYSTEM STATS: Time:46.3511 Method:spine Processes:1 Threads:5 Hosts:29 HostsPerProcess:29 DataSources:834 RRDsProcessed:268
23/Feb/2021 21:50:02 - POLLER: Poller[1] WARNING: Poller Output Table not Empty.  Issues: 8, DS[3146, 3149, 3157, 3932, 4090, 4099, 4107, 4116]
23/Feb/2021 21:50:44 - SYSTEM STATS: Time:42.3552 Method:spine Processes:1 Threads:5 Hosts:29 HostsPerProcess:29 DataSources:834 RRDsProcessed:253
23/Feb/2021 21:55:01 - POLLER: Poller[1] WARNING: Poller Output Table not Empty.  Issues: 9, DS[3111, 3146, 3149, 3153, 3157, 3932, 4090, 4099, 4107]
23/Feb/2021 21:55:48 - SYSTEM STATS: Time:46.3716 Method:spine Processes:1 Threads:5 Hosts:29 HostsPerProcess:29 DataSources:834 RRDsProcessed:261
23/Feb/2021 22:00:01 - POLLER: Poller[1] WARNING: Poller Output Table not Empty.  Issues: 10, DS[3131, 3146, 3149, 3157, 3880, 3885, 4090, 4099, 4107, 4116]
23/Feb/2021 22:00:44 - SYSTEM STATS: Time:42.3625 Method:spine Processes:1 Threads:5 Hosts:29 HostsPerProcess:29 DataSources:834 RRDsProcessed:255
23/Feb/2021 22:05:01 - POLLER: Poller[1] WARNING: Poller Output Table not Empty.  Issues: 10, DS[3107, 3111, 3131, 3146, 3149, 4090, 4099, 4107, 4136, 4143]

So it really seems getting "Device Threads" to 1 pretty much get away with the problem I was experiencing.

Tried reverting spine to the stock 1.2.16 (instead of git 1.2.x), changed "Devices Threads" back to 1 for all devices. Pretty much same results from 1.2.x, everything seems fine with Device Threads set to 1.

Please let me know if I can provide you any other information, I'll be glad to grab them here.

TheWitness commented 3 years ago

Well, increase your spine threads. Ensure that you always have a multiple of the max device threads allocated. So, and only if you need to, set device threads to 4, make sure spine get 4 or 8 or 12 threads. Upto 20 is fine.

TheWitness commented 3 years ago

But this highlights that it's the poller.php that is the problem.

leonardocuei commented 3 years ago

But this highlights that it's the poller.php that is the problem.

Are there any further tests you believe could be done to help tracking that? If yes, i'll be glad to run them.

leonardocuei commented 3 years ago

for testing purposes only, do you think it would be a very bad idea adding a new column to the poller_item table, an integer auto_increment one, having a key on that one, and using this new column to "fix" the "ORDER BY" clauses on poller.c? Having an integer and indexed new column, for sure, would solve that uniqueness problem on the MySQL queries and wouldn't hurt performance in a meaningful way, which using all those other column would for sure, as they would likely require full table scans.

do you think that would be a very bad test/approach?

TheWitness commented 3 years ago

There is already a dsnum column that is supposed to keep all the output in the poller cache until they are all present. So, if traffic_in comes from thread1 first, its not supposed to not be flushed until traffic_out shows us from thread2. It looks like that is broken ATM. So there are two possible solutions, find a way to distribute the threads without breaking that relationship, or fix lib/poller.php.

leonardocuei commented 3 years ago

There is already a dsnum column that is supposed to keep all the output in the poller cache until they are all present. So, if traffic_in comes from thread1 first, its not supposed to not be flushed until traffic_out shows us from thread2. It looks like that is broken ATM. So there are two possible solutions, find a way to distribute the threads without breaking that relationship, or fix lib/poller.php.

on the poller_item table? I don't have it column here. As this is an old install with lots of upgrades to be on the current 1.2.16 version, maybe some cacti upgrade missed that column. To confirm that, I download a fresh 1.2.16 cacti tgz, extracted it, look for dsnum on the cacti.sql database init file, and it's not there neither. Can't see this column anywhere ...

TheWitness commented 3 years ago

It's called rrd_num.

TheWitness commented 3 years ago

There have been some recent changes to the poller logic. You might want to consider updating to the 1.2.x branch. It's pretty stable at the moment. If the issue resolves itself, feel free to close.

Cacti / spine

Issues with polling loop may skip some datasources #172