Cacti / cacti

Cacti ™
http://www.cacti.net
GNU General Public License v2.0
1.63k stars 404 forks source link

Possible Memory Leak/Poor Memory Management #3217

Closed danfiscus closed 4 years ago

danfiscus commented 4 years ago

Describe the bug Most bulk changes in Cacti lead to PHP running out of memory very quickly. I tried to sync the devices to the host templates I have set up, and there are 68 devices using the template I am syncing to. When I run the sync, it has no problem reporting when a new graph did not need to be created, but as soon a graph creation is required, the first attempt to create a graph results in running out of available memory. I have tried increasing the memory available numerous times, currently it has 1G allocated in php.ini, and this is the console error I have seen each time (albeit with different values for memory size):

2020/01/24 14:02:42 - AUTOM8 NOTE: [PID: 1216] automation_execute_graph_template() Device[43] Graph Creation Skipped - Already Exists - Graph[2331] - DS[2426]
2020/01/24 14:02:50 - AUTOM8 NOTE: [PID: 1216] create_graph_node() Graph[5349] Tree Item - Added - id: (167)
2020/01/24 14:02:53 - ERROR PHP ERROR: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 4096 bytes) in file: /var/www/html/cacti/lib/database.php on line: 476
2020/01/24 14:02:53 - CMDPHP PHP ERROR Backtrace:  (CactiShutdownHandler())

Now, as a disclaimer, I am very new to Cacti and it's possible I have something configured wrong, but regardless, the community here has been incredibly helpful to me before so I thought it was worth reporting in case it's something fixable.

To Reproduce Steps to reproduce the behavior: I have a host template called "Cisco Switch" which, among other things, should create graphs for each network interface on the network switches using the template. Some switches have only one interface reporting, while others have hundreds. When I attempt to Sync Devices to Template, as soon as the function to create a new graph is called, Cacti/PHP runs out of free memory.

Expected behavior I expected 1G of memory to be sufficient for this application, or for the program to automatically free up memory not in use to allow operations to continue.

Screenshots N/A

Desktop (please complete the following information): I have the latest version of Cacti (1.2.8) running in a Red Hat VM, with 2G of total ram allocated to the VM, and 1G in php.ini set as the resource limit. I am using the updated host_templates.php file from the 1.2.X branch because it contains a bugfix for an issue I encountered.

Additional context I am not familiar with PHP or Cacti so please don't hesitate to tell me if this is not a bug but rather a result of poor configuration.

danfiscus commented 4 years ago

I have now given the VM 4GB of memory, and set the max resource limit in PHP.ini to unlimited and am still running into memory allocation errors on graph creation.

cigamit commented 4 years ago

Upgrade you main instance to 1.2.9. It will self upgrade the Remote Data collectors. It would be nice to understand where this is happening. Post some stats from your main data collector's cacti log and the remote data collectors cacti log, so we know what the scale of you install is. From my perspective, installs in 1.2.x can scale to well over 20k hosts.

danfiscus commented 4 years ago

Apologies for opening this issue and letting it sit for so long, I'm back in the office now. What stats would be helpful @cigamit ? Also would you prefer to see the logs before I upgrade or after?

Edit: Some basic stats, if they are of use:

If it helps, I also have a problem where my data collection log is full of thousands of WARNING: Result from SNMP not valid. Partial Result: U errors from various network interfaces, despite the fact that I updated the data source to make sure it had a maximum value of U, and changed the base network speed to ensure it wasn't just receiving large values it didn't expect.

Edit 2: Also, as mentioned in a different issue, this is all running in a VM which had a clean shutdown performed over the weekend, and was turned back on this morning after it was off for two days. As of now, about 30 minutes after startup, the performance issues and severe memory usage have not appeared, leading me to learn towards a possible memory leak?

Edit 3: I enabled logging of statistics and here's the report from the log file: 2020/01/27 13:40:26 - SYSTEM STATS: Time:23.9617 Method:cmd.php Processes:8 Threads:0 Hosts:120 HostsPerProcess:15 DataSources:10046 RRDsProcessed:5389

danfiscus commented 4 years ago

Okay, I've managed to trigger the performance issue by requesting a re-index of a single switch. Within a few seconds, all my open web sessions had timed out or disconnected, with every single action either doing nothing or leads to a 504 Gateway Error. CPU usage hits 99%, all from a single php process, and this is the first time I've managed to capture a log while it was happening, leading to tens of thousands of lines like this:

e"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5426","local_graph_id":"5341","height":"200","width":"700","title_cache":"sw0.idf1.mct.our.network.domain - Broadcast Packets - Vl301","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5427","local_graph_id":"5342","height":"150","width":"600","title_cache":"ups0.idf1.srh.our.network.domain - Amperes","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5428","local_graph_id":"5343","height":"150","width":"600","title_cache":"ups0.idf1.srh.our.network.domain - Amperes","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5429","local_graph_id":"5344","height":"200","width":"700","title_cache":"sw0.idf1.ell.NEW.our.network.domain - Broadcast Packets - Vl302","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5430","local_graph_id":"5345","height":"120","width":"500","title_cache":"sw0.idf1.ell.NEW.our.network.domain - Traffic - Vl1","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5431","local_graph_id":"5346","height":"120","width":"500","title_cache":"sw0.idf1.ell.NEW.our.network.domain - Traffic - Gi1\/0\/1","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5432","local_graph_id":"5347","height":"120","width":"500","title_cache":"sw0.idf1.mct.our.network.domain - Traffic - Vl1","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5433","local_graph_id":"5348","height":"120","width":"500","title_cache":"sw0.idf1.mct.our.network.domain - Traffic - Gi1\/0\/1","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5434","local_graph_id":"5349","height":"120","width":"500","title_cache":"sw0.idf1.mct.our.network.domain - Traffic - Gi1\/0\/2","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5435","local_graph_id":"5350","height":"200","width":"700","title_cache":"sw0.mdf.fsc.our.network.domain - Broadcast Packets - FSC Mgmt","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5436","local_graph_id":"5351","height":"120","width":"500","title_cache":"sw0.mdf.fsc.our.network.domain - Traffic - Fa1","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5437","local_graph_id":"5352","height":"120","width":"500","title_cache":"sw0.mdf.fsc.our.network.domain - Traffic - Te2\/1","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5438","local_graph_id":"5353","height":"120","width":"500","title_cache":"sw0.mdf.wri.NEW.our.network.domain - Memory Usage","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5439","local_graph_id":"5354","height":"120","width":"500","title_cache":"sw0.mdf.wri.NEW.our.network.domain - Traffic - Vl1","name":"2 Cisco - Memory Usage"}]
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] create_all_header_nodes() called: Item 5354 matches: 0 items
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] automation_execute_graph_create_tree() Graph[5354], Rule: 5, Parent: 45
2020/01/27 14:16:37 - AUTOM8 NOTE: [PID: 1525] create_graph_node() Graph[5354] Tree Item - Added - id: (176)
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] automation_execute_graph_create_tree() Graph[5354], Rule: 5, Node: 176
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] automation_execute_graph_create_tree() Graph[5354], rule: 6, name: RLH Tree, type: 2
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] get_matching_graphs() called: {"id":"6","name":"RLH Tree","tree_id":"1","tree_item_id":"49","leaf_type":"2","host_grouping_type":"1"} type: 3
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] build_matching_objects_filter() called rule id: 6
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] build_rule_item_filter() called: [{"id":"16","rule_id":"6","rule_type":"3","sequence":"1","operation":"0","field":"h.description","operator":"5","pattern":".rlh"},{"id":"17","rule_id":"6","rule_type":"3","sequence":"2","operation":"2","field":"h.description","operator":"5","pattern":".rlh.our.network.domain"},{"id":"27","rule_id":"6","rule_type":"3","sequence":"3","operation":"2","field":"h.description","operator":"5","pattern":".rlh.NEW.our.network.domain"}], prefix:
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] build_rule_item_filter() returns:   `h`.`description` LIKE '%.rlh' OR `h`.`description` LIKE '%.rlh.our.network.domain' OR `h`.`description` LIKE '%.rlh.NEW.our.network.domain'
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] build_matching_objects_filter() returns:   `h`.`description` LIKE '%.rlh' OR `h`.`description` LIKE '%.rlh.our.network.domain' OR `h`.`description` LIKE '%.rlh.NEW.our.network.domain'
2020/01/27 14:17:31 - AUTOM8 TRACE [PID: 1525] get_matching_graphs() returning: SELECT h.id AS host_id, h.hostname, h.description, h.disabled,          h.status, ht.name AS host_template_name, gtg.id,      gtg.local_graph_id, gtg.height, gtg.width, gtg.title_cache, gt.name              FROM graph_local AS gl          INNER JOIN graph_templates_graph AS gtg         LEFT JOIN graph_templates AS gt         ON (gl.graph_template_id=gt.id)                LEFT JOIN host AS h             ON (gl.host_id=h.id)            LEFT JOIN host_template AS ht           ON (h.host_template_id=ht.id)WHERE gl.id=gtg.local_graph_id AND   `h`.`description` LIKE '%.rlh' OR `h`.`description` LIKE '%.rlh.our.network.domain' OR `h`.`description` LIKE '%.rlh.NEW.our.network.domain' AND gl.id=5354 matches: 1845792

screengrab from the linux top command from the previous time this happened (not the one I have a log for): PHP Full CPU

netniV commented 4 years ago

I haven't read through this fully, but word of warning, by default PHP only allows a single active session process at a time. So, if you have multiple tabs, it will lock all of them up if it is waiting for just one, purely because the others can't obtain the sole lock of the session.

We have done some work to reduce what is being done whilst the session state is active, but there is quite a bit that is still stored there so we haven't eliminated it yet.

Can I presume you started the reindex via the gui?

danfiscus commented 4 years ago

I will keep the single session rule in mind, at the time I was only using the other window to watch the logs as my SSH session was being used to monitor memory/CPU usage. To answer your question, yes, I started the re-index from the web GUI, on the page for the device I re-indexed

netniV commented 4 years ago

I'm not sure that spawns a background process, but it really should do if it doesn't. We do background processes for other intensive stuff like automation, installation, etc.

danfiscus commented 4 years ago

Would/could re-indexing be a multi-threaded process, in addition to running in the background?

netniV commented 4 years ago

Against a single device, there wouldn't be any point. If it was reindexing more than a couple or everything, you might want that because there would be a lot to process.

danfiscus commented 4 years ago

Removed a previous reply because I was very mistaken about how I had configured something.

cigamit commented 4 years ago

Okay, closing then. Thanks.