Closed danfiscus closed 4 years ago
I have now given the VM 4GB of memory, and set the max resource limit in PHP.ini to unlimited and am still running into memory allocation errors on graph creation.
Upgrade you main instance to 1.2.9. It will self upgrade the Remote Data collectors. It would be nice to understand where this is happening. Post some stats from your main data collector's cacti log and the remote data collectors cacti log, so we know what the scale of you install is. From my perspective, installs in 1.2.x can scale to well over 20k hosts.
Apologies for opening this issue and letting it sit for so long, I'm back in the office now. What stats would be helpful @cigamit ? Also would you prefer to see the logs before I upgrade or after?
Edit: Some basic stats, if they are of use:
If it helps, I also have a problem where my data collection log is full of thousands of WARNING: Result from SNMP not valid. Partial Result: U
errors from various network interfaces, despite the fact that I updated the data source to make sure it had a maximum value of U, and changed the base network speed to ensure it wasn't just receiving large values it didn't expect.
Edit 2: Also, as mentioned in a different issue, this is all running in a VM which had a clean shutdown performed over the weekend, and was turned back on this morning after it was off for two days. As of now, about 30 minutes after startup, the performance issues and severe memory usage have not appeared, leading me to learn towards a possible memory leak?
Edit 3:
I enabled logging of statistics and here's the report from the log file:
2020/01/27 13:40:26 - SYSTEM STATS: Time:23.9617 Method:cmd.php Processes:8 Threads:0 Hosts:120 HostsPerProcess:15 DataSources:10046 RRDsProcessed:5389
Okay, I've managed to trigger the performance issue by requesting a re-index of a single switch. Within a few seconds, all my open web sessions had timed out or disconnected, with every single action either doing nothing or leads to a 504 Gateway Error. CPU usage hits 99%, all from a single php process, and this is the first time I've managed to capture a log while it was happening, leading to tens of thousands of lines like this:
e"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5426","local_graph_id":"5341","height":"200","width":"700","title_cache":"sw0.idf1.mct.our.network.domain - Broadcast Packets - Vl301","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5427","local_graph_id":"5342","height":"150","width":"600","title_cache":"ups0.idf1.srh.our.network.domain - Amperes","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5428","local_graph_id":"5343","height":"150","width":"600","title_cache":"ups0.idf1.srh.our.network.domain - Amperes","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5429","local_graph_id":"5344","height":"200","width":"700","title_cache":"sw0.idf1.ell.NEW.our.network.domain - Broadcast Packets - Vl302","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5430","local_graph_id":"5345","height":"120","width":"500","title_cache":"sw0.idf1.ell.NEW.our.network.domain - Traffic - Vl1","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5431","local_graph_id":"5346","height":"120","width":"500","title_cache":"sw0.idf1.ell.NEW.our.network.domain - Traffic - Gi1\/0\/1","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5432","local_graph_id":"5347","height":"120","width":"500","title_cache":"sw0.idf1.mct.our.network.domain - Traffic - Vl1","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5433","local_graph_id":"5348","height":"120","width":"500","title_cache":"sw0.idf1.mct.our.network.domain - Traffic - Gi1\/0\/1","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5434","local_graph_id":"5349","height":"120","width":"500","title_cache":"sw0.idf1.mct.our.network.domain - Traffic - Gi1\/0\/2","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5435","local_graph_id":"5350","height":"200","width":"700","title_cache":"sw0.mdf.fsc.our.network.domain - Broadcast Packets - FSC Mgmt","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5436","local_graph_id":"5351","height":"120","width":"500","title_cache":"sw0.mdf.fsc.our.network.domain - Traffic - Fa1","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5437","local_graph_id":"5352","height":"120","width":"500","title_cache":"sw0.mdf.fsc.our.network.domain - Traffic - Te2\/1","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5438","local_graph_id":"5353","height":"120","width":"500","title_cache":"sw0.mdf.wri.NEW.our.network.domain - Memory Usage","name":"2 Cisco - Memory Usage"},{"host_id":"129","hostname":"192.123.456.120","description":"sw0.012.ell.our.network.domain","disabled":"","status":"3","host_template_name":"Cisco Switch","id":"5439","local_graph_id":"5354","height":"120","width":"500","title_cache":"sw0.mdf.wri.NEW.our.network.domain - Traffic - Vl1","name":"2 Cisco - Memory Usage"}]
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] create_all_header_nodes() called: Item 5354 matches: 0 items
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] automation_execute_graph_create_tree() Graph[5354], Rule: 5, Parent: 45
2020/01/27 14:16:37 - AUTOM8 NOTE: [PID: 1525] create_graph_node() Graph[5354] Tree Item - Added - id: (176)
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] automation_execute_graph_create_tree() Graph[5354], Rule: 5, Node: 176
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] automation_execute_graph_create_tree() Graph[5354], rule: 6, name: RLH Tree, type: 2
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] get_matching_graphs() called: {"id":"6","name":"RLH Tree","tree_id":"1","tree_item_id":"49","leaf_type":"2","host_grouping_type":"1"} type: 3
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] build_matching_objects_filter() called rule id: 6
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] build_rule_item_filter() called: [{"id":"16","rule_id":"6","rule_type":"3","sequence":"1","operation":"0","field":"h.description","operator":"5","pattern":".rlh"},{"id":"17","rule_id":"6","rule_type":"3","sequence":"2","operation":"2","field":"h.description","operator":"5","pattern":".rlh.our.network.domain"},{"id":"27","rule_id":"6","rule_type":"3","sequence":"3","operation":"2","field":"h.description","operator":"5","pattern":".rlh.NEW.our.network.domain"}], prefix:
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] build_rule_item_filter() returns: `h`.`description` LIKE '%.rlh' OR `h`.`description` LIKE '%.rlh.our.network.domain' OR `h`.`description` LIKE '%.rlh.NEW.our.network.domain'
2020/01/27 14:16:37 - AUTOM8 TRACE [PID: 1525] build_matching_objects_filter() returns: `h`.`description` LIKE '%.rlh' OR `h`.`description` LIKE '%.rlh.our.network.domain' OR `h`.`description` LIKE '%.rlh.NEW.our.network.domain'
2020/01/27 14:17:31 - AUTOM8 TRACE [PID: 1525] get_matching_graphs() returning: SELECT h.id AS host_id, h.hostname, h.description, h.disabled, h.status, ht.name AS host_template_name, gtg.id, gtg.local_graph_id, gtg.height, gtg.width, gtg.title_cache, gt.name FROM graph_local AS gl INNER JOIN graph_templates_graph AS gtg LEFT JOIN graph_templates AS gt ON (gl.graph_template_id=gt.id) LEFT JOIN host AS h ON (gl.host_id=h.id) LEFT JOIN host_template AS ht ON (h.host_template_id=ht.id)WHERE gl.id=gtg.local_graph_id AND `h`.`description` LIKE '%.rlh' OR `h`.`description` LIKE '%.rlh.our.network.domain' OR `h`.`description` LIKE '%.rlh.NEW.our.network.domain' AND gl.id=5354 matches: 1845792
screengrab from the linux top
command from the previous time this happened (not the one I have a log for):
I haven't read through this fully, but word of warning, by default PHP only allows a single active session process at a time. So, if you have multiple tabs, it will lock all of them up if it is waiting for just one, purely because the others can't obtain the sole lock of the session.
We have done some work to reduce what is being done whilst the session state is active, but there is quite a bit that is still stored there so we haven't eliminated it yet.
Can I presume you started the reindex via the gui?
I will keep the single session rule in mind, at the time I was only using the other window to watch the logs as my SSH session was being used to monitor memory/CPU usage. To answer your question, yes, I started the re-index from the web GUI, on the page for the device I re-indexed
I'm not sure that spawns a background process, but it really should do if it doesn't. We do background processes for other intensive stuff like automation, installation, etc.
Would/could re-indexing be a multi-threaded process, in addition to running in the background?
Against a single device, there wouldn't be any point. If it was reindexing more than a couple or everything, you might want that because there would be a lot to process.
Removed a previous reply because I was very mistaken about how I had configured something.
Okay, closing then. Thanks.
Describe the bug Most bulk changes in Cacti lead to PHP running out of memory very quickly. I tried to sync the devices to the host templates I have set up, and there are 68 devices using the template I am syncing to. When I run the sync, it has no problem reporting when a new graph did not need to be created, but as soon a graph creation is required, the first attempt to create a graph results in running out of available memory. I have tried increasing the memory available numerous times, currently it has 1G allocated in php.ini, and this is the console error I have seen each time (albeit with different values for memory size):
Now, as a disclaimer, I am very new to Cacti and it's possible I have something configured wrong, but regardless, the community here has been incredibly helpful to me before so I thought it was worth reporting in case it's something fixable.
To Reproduce Steps to reproduce the behavior: I have a host template called "Cisco Switch" which, among other things, should create graphs for each network interface on the network switches using the template. Some switches have only one interface reporting, while others have hundreds. When I attempt to Sync Devices to Template, as soon as the function to create a new graph is called, Cacti/PHP runs out of free memory.
Expected behavior I expected 1G of memory to be sufficient for this application, or for the program to automatically free up memory not in use to allow operations to continue.
Screenshots N/A
Desktop (please complete the following information): I have the latest version of Cacti (1.2.8) running in a Red Hat VM, with 2G of total ram allocated to the VM, and 1G in php.ini set as the resource limit. I am using the updated host_templates.php file from the 1.2.X branch because it contains a bugfix for an issue I encountered.
Additional context I am not familiar with PHP or Cacti so please don't hesitate to tell me if this is not a bug but rather a result of poor configuration.