Cacti / spine

Spine C Based Poller for Cacti
GNU Lesser General Public License v2.1
81 stars 45 forks source link

latest spine (stable and from here) works at least 2x slower than 1.1.35 (ubuntu 18.04) and seems not closing mysql connection #99

Closed gzivdo closed 5 years ago

gzivdo commented 5 years ago

So all in the subject. So after upgrade i have

2019/07/24 22:05:01 - SYSTEM STATS: Time:298.2058 Method:spine Processes:10 Threads:30 Hosts:335 HostsPerProcess:34 DataSources:50171 RRDsProcessed:10
436

and many holes in graphs and

2019/07/24 22:05:04 - SPINE: Poller[1] FATAL: Connection Failed, Error:'1040', Message:'Too many connections' [95, Operation not supported] (Spine thre
ad)

When i revert back typically i have

2019/07/26 18:02:31 - SYSTEM STATS: Time:148.5720 Method:spine Processes:10 Threads:30 Hosts:335 HostsPerProcess:34 DataSources:50171 RRDsProcessed:205
65

and very rare i have timeout.

mysql allow 3000 connections in my.cnf

So basically I should say that core architecture of cacti have some design issues. 1) spine should open only 1 mysql connection per process and should have separate thread processing mysql requests communicating with other threads. Mysql not so effective, especially data is not stored in mysql. 2) script query and other queries should allow bulk data submissions, i suppose even spine use snmp get for interfaces via snmp instead of walk for all at once.

netniV commented 5 years ago

Yes, spine does tend to favour a get over walk purely because you may only need one OID at a certain level for the index, not everything beneath it which would be a waste of bandwidth.

If you are finding that you have too many connections, then you have likely misconfigured the balance between processes and threads as spine closes each connection it makes (we tested for memory leaks regarding that earlier this year and there has been little changes since).

Alternatively, as another user found out, just because you BELIEVE that you have configured MySQL to use something, you need to verify that MySQL actually honours that (they found various settings were being completely ignored).

netniV commented 5 years ago

Additionally, if you follow the link below, you'll see that most of the changes that we have made are either cosmetic, legal (copyright notices) or actually free up resources were we weren't.

If you can spot anything you believe to be wrong with the code, feel free to pop a comment on the appropriate line.

https://github.com/Cacti/spine/compare/release/1.1.35...develop#diff-32f5bec80fb31433871b09021cec741d

gzivdo commented 5 years ago

Yes, spine does tend to favour a get over walk purely because you may only need one OID at a certain level for the index, not everything beneath it which would be a waste of bandwidth.

This is very bad vision. You need make both options and even it possible to autodetect when walk is preferable for snmp. The small example. I have few scripts which do ssh and do very resource intensive command on my device, the output gives all required data at each run. And with current concept i have to run 30+ times this cpu intensive query on my device, but should do this only once. So currently i do manual caching in my query script, but this should be done in cacti core - query plugins should allow bulk submission.

If you are finding that you have too many connections, then you have likely misconfigured the balance between processes and threads as spine closes each connection it makes (we tested for memory leaks regarding that earlier this year and there has been little changes since). Alternatively, as another user found out, just because you BELIEVE that you have configured MySQL to use something, you need to verify that MySQL actually honours that (they found various settings were being completely ignored).

Yes, you were right about this, with systemd need to change file open file limit in service file (https://stackoverflow.com/questions/22495124/cannot-set-limit-of-mysql-open-files-limit-from-1024-to-65535?answertab=votes#tab-top)

Now i have no "mysql connection error", but sometimes spine run ~1 minute and sometimes not ends at all and i have with latest spine from git

2019/07/28 20:49:16 - SYSTEM STATS: Time:853.6860 Method:spine Processes:10 Threads:30 Hosts:335 HostsPerProcess:34 DataSources:50171 RRDsProcessed:37194 

2019/07/28 20:40:03 - POLLER: Poller[1] WARNING: Poller Output Table not Empty. Issues: 761, DS[31503, 31329, 31330, 31330, 31330, 31330, 31331, 31331, 31331, 31332, 31332, 31331, 31332, 31332, 31333, 31446, 31446, 31348, 31348, 31349], Additional Issues Remain. Only showing first 20 

2019/07/28 20:38:31 - SYSTEM STATS: Time:508.6262 Method:spine Processes:10 Threads:30 Hosts:335 HostsPerProcess:34 DataSources:50171 RRDsProcessed:20562 

Please also take attention about bulk data query plugin submission and moving to walk if at least 1/3 of snmp table is queried.

netniV commented 5 years ago

We have no way of knowing how much data will come back. Someone could have 10 oiids with 1000s rows where someone else many have 100 oids with an average of 5 rows.

When working with tables though I do think it uses walk but I’d have to read the code to remind myself as spine doesn’t change much.

If it is something that is resource intensive a better choice would be a script server script that uses a cache created on the first element and removed at the last to keep things fresh. Or even just a small amount of time.

cigamit commented 5 years ago

Please update to the latest develop. Unless you are using some of the latest intel or ibm chips that have well over 100 threads, having 300+ threads is a train wreck. Better to have multiple collectors and a massive reduction of thread and process counts.