Cacti / cacti

Cacti ™
http://www.cacti.net
GNU General Public License v2.0
1.61k stars 399 forks source link

[1.2.6] Boost Issues #2962

Closed eschoeller closed 4 years ago

eschoeller commented 4 years ago

I have encountered similar issues to what other users reported in #2929. I have implemented the suggested code, but I'm still running much slower than I was in 1.2.4.

Here are the results from my most recent forced boost run:

SYSTEM BOOST DETAIL STATS: RRDUpdates:189016, TotalTime:446, get_records:2.34, results_cycle:438.68, rrd_filename_and_template:14.43, rrd_lastupdate:200.77, rrdupdate:210.99, delete:2.75

Here are some other settings:

| boost_last_run_time                     | 2019-09-21 22:44:32                                                                                                       
| boost_max_output_length                 | 1569128015:0                                                                                                              
| boost_next_run_time                     | 2019-09-21 23:14:32                                                                                                       
| boost_parallel                          | 4                                                                                                                         
| boost_peak_memory                       | 133654568                                                                                                                 
| boost_png_cache_directory               | /cacti/cacti-1.2.6-prod/cache/boost/                                                                                      
| boost_png_cache_enable                  | on                                                                                                                        
| boost_poller_mem_limit                  | 2048                                                                                                                      
| boost_poller_status                     | complete - end time:2019-09-21 22:51:58                                                                                   
| boost_redirect                          | on                                                                                                                        
| boost_rrd_update_enable                 | on                                                                                                                        
| boost_rrd_update_interval               | 30                                                                                                                        
| boost_rrd_update_max_records            | 450000                                                                                                                    
| boost_rrd_update_max_records_per_select | 100000                                                                                                                    
| boost_rrd_update_max_runtime            | 1200                                                                                                                      
| boost_rrd_update_string_length          | 65535                                                                                                                     
| boost_rrd_update_system_enable          | on                                                                                                                        
| path_boost_log                          | /var/log/cacti/boost.log                                                                                                  
| stats_boost                             | Time:445.9800 RRDUpdates:189016

In 1.2.4 I had boost_rrd_update_string_length set to '2000'. Increasing this value did not improve the boost performance.

I have tried to increment boost_parallel to combat this problem, but no matter what setting I use, there is still only one poller_boost.php process which runs. It only consumes about 44% of one CPU. MySQL really isn't consuming many resources, either.

Here are some entries from mysql-slow.log during a poller_boost run. I don't think it's that interesting:

# Time: 190921 22:47:54
# User@Host: apache[apache] @ localhost []  Id: 3904064
# Query_time: 4.946821  Lock_time: 0.000321 Rows_sent: 22828  Rows_examined: 1297004
SET timestamp=1569127674;
SELECT did.* FROM data_input_data AS did INNER JOIN data_template_data AS dtd ON did.data_template_data_id=dtd.id INNER JOIN data_local AS dl ON dl.id=dtd.local_data_dl.host_id WHERE h.poller_id = '4';
# Time: 190921 22:48:07
# User@Host: apache[apache] @ localhost []  Id: 3904062
# Query_time: 6.260344  Lock_time: 0.000318 Rows_sent: 21027  Rows_examined: 1297004
SET timestamp=1569127687;
SELECT did.* FROM data_input_data AS did INNER JOIN data_template_data AS dtd ON did.data_template_data_id=dtd.id INNER JOIN data_local AS dl ON dl.id=dtd.local_data_dl.host_id WHERE h.poller_id = '3';
# User@Host: apache[apache] @ localhost []  Id: 3878465
# Query_time: 1.477407  Lock_time: 0.000395 Rows_sent: 0  Rows_examined: 90030
SET timestamp=1569127687;
DELETE FROM poller_output_boost_arch_1569127472 WHERE local_data_id <= 48857;
# Time: 190921 22:51:58
# User@Host: apache[apache] @ localhost []  Id: 3878465
# Query_time: 1.273671  Lock_time: 0.000164 Rows_sent: 0  Rows_examined: 98124
SET timestamp=1569127918;
DELETE FROM poller_output_boost_arch_1569127472 WHERE local_data_id <= 71453;

I am also experiencing the same issue #2923. I believe these logs should be directed into the boost.log. I'm getting only status messages there, which doesn't make very effective use of the log file. Also, in 1.2.4 I would see these types of messages in boost.log:

OK u:1.32 s:1.33 r:37.37
ERROR: /cacti/cacti-1.2.4-prod/rra/XXX.rrd: expected 3 data source readings (got 2) from 1569110073
1569108212
OK u:1.32 s:1.33 r:37.37
OK u:1.32 s:1.33 r:37.37
1569110014
OK u:1.32 s:1.33 r:37.37
OK u:1.32 s:1.33 r:37.37

From what I see in issue #1257 the lack of these messages might be because I don't have debug enabled? Either way, this seems like stderr-style messages which might be getting lost, similar to other discussions we've had in the past on the subject.

I'm going to keep poking around with this tonight, unfortunately I'll be mostly unavailable tomorrow.

eschoeller commented 4 years ago

Yes the STDOUT/STDERR from rrdtool which was originally being redirected into boost.log is now lost, despite which log verbosity options I select.

eschoeller commented 4 years ago

I knocked the boost_rrd_update_max_records down to 250k so it would run more often. Here are some stats: cacti_graph_56748 cacti_graph_56749 cacti_graph_56750 cacti_graph_67991 cacti_graph_67992 cacti_graph_67993 cacti_graph_67994

(sadly did not have these graphs working while I was running 1.2.4)

TheWitness commented 4 years ago

Nice Graphs. There was another change in 1.2.6 around the use of procopen vs. popen. That should not be an issue, but it should be checked too. @eschoeller the rrdlastupdate should no longer be required with RRDtool 1.5.x. Someone should look at this.

cigamit commented 4 years ago

I've added some debug, and it does indeed look like the culprit is not the fixing that was done for boost itself, but the proc_open() fix. Investigating now.

cigamit commented 4 years ago

Update the lib/boost.php lib/dsstats.php lib/rrd.php and poller_boost.php and let me know the difference.

eschoeller commented 4 years ago

Applied the updates.

I do see these in cacti.log:

 2019/09/22 10:53:08 - BOOST WARNING: Stale Poller Data Found! Item Time:'1569171061', RRD Time:'1569171062' Ignoring Value!
2019/09/22 10:53:08 - BOOST WARNING: Stale Poller Data Found! Item Time:'1569171061', RRD Time:'1569171062' Ignoring Value!
2019/09/22 10:53:08 - BOOST WARNING: Stale Poller Data Found! Item Time:'1569171061', RRD Time:'1569171062' Ignoring Value!
2019/09/22 10:53:08 - BOOST WARNING: Stale Poller Data Found! Item Time:'1569171061', RRD Time:'1569171062' Ignoring Value!

I may just need to flush some tables. I'll keep monitoring, but this is a huge improvement. thank you! I'm going to likely drop off the radar until tomorrow.

eschoeller commented 4 years ago

I'm actually getting a lot of those Stale Poller Data messages. I don't think flushing the output tables would help. I can say that after every poller run the poller_output table still has data in it. I can't easily cross-reference the output in the table to the Stale Poller data messages. I think that log line should include the device ID and the data source ID for better troubleshooting.

cigamit commented 4 years ago

Thanks for the feedback. What RRDtool version are you on? I'll fix that 'stale' data bit. It's not an issue past RRDtool 1.5 as we ignore it.

eschoeller commented 4 years ago

I'm on rrdtool 1.7.0. I was thinking about upgrading soon. Does the message indicate data loss?

cigamit commented 4 years ago

No, there is no data loss. Just an old message that can be ignored now. I'll have it fixed shortly.

cigamit commented 4 years ago

Taking that back. Update lib/boost.php and poller_boost.php with the latest updates.

cigamit commented 4 years ago

Boost should be plenty fast now. Especially for large boost tables.

eschoeller commented 4 years ago

Applied. I'll let you know. My plans for today keep getting shifted back. But at some point I am certainly stepping away from the computer ;)

eschoeller commented 4 years ago

Dang, it's taking forever to run again. I might have grabbed the wrong files. Yep. I got the files from the develop branch. Woops. Fixing ..

eschoeller commented 4 years ago

OK, it's back to 30 second runtime. Stale messages appear to be gone as well.