Enhance boost to consume less I/O for traditional disk drives during boost updates

jdcoats commented 5 years ago

I cannot figure out why spine runs so long for 2 polling cycles immediately after boost runs and then resumes normal. Its driving me crazy making a few random gaps in graphs. right after boost completes I have a bunch of slowlog entries like attached.

screen shot 02-21-19 at 03 26 pm slowlog.txt

netniV commented 5 years ago

This would suggest that the MySQL process doesn't have enough resources unless there's a key missing.

jdcoats commented 5 years ago

my.cnf.txt mysql-variables-status.txt

Ive been tuning for 2 weeks and cant seem to find the deficiency in resources. IO is not extreme during the boost runs but the poller that started during that run and the poller immediately after run between 50-65 sec and then resume back to ~4sec

I did just raise innodb_buffer_pool_instances to 64 as the cacti tech support suggests even though mysqltuner suggest it be at 16. The same thing still happens.

Attached are status and variables. I do see in show status that there are a lot of Table_locks_immediate and wonder if that is suspect since i didn't think innodb used many locks if at all.

jdcoats commented 5 years ago

Iv'e thrown so many resources at this thing that i cant believe it is a resource issue. It happens with 1 minute polling and if boost happens to run in 40 sec and spine starts somewhere outside of this cycle its okay, but if boost is running and spine starts then the following 2 polling cycles are delayed up to a minute each. This includes the spine polling cycle after boost is done and there is literally no IO. This should not happen with a poller that can normally run at 4-5sec. there is no extreme IO happening at this time, This is a new install with a relatively small 12G rra dir. I can watch iotop during the run and never get reads or writes over 50mb where these disks are not ssd but will do above 150-200mb without causing this. This is new since upgrade to 1.x, I just cant get to the bottom of it. I have set boost to 1 process and 2000 items per pass every 30min/1000000 records which is where it runs best for me since 1.x, Really seems like there is some yet to be identified bug, since spine runs faster than ever 4 sec vs. 30 sec previously but it cant handle boost running in parallel. I used to be able to have boost running through multiple polling cycles with spine not hung up but not now.

jdcoats commented 5 years ago

attached is the slowquerylog for the past couple hours keeping in mind that the only time slow queries happen are every 30 minutes a few seconds after boost has completed and spine is hung up running slow slowquerylog.txt .

netniV commented 5 years ago

What sizes are your database and tables? Are there other databases on here?

jdcoats commented 5 years ago

only cacti and syslog

syslog01 ~ # du -h /var/lib/mysql/cacti/ 868M /var/lib/mysql/cacti/ syslog01 ~ # du -h /var/lib/mysql/syslog 11G /var/lib/mysql/syslog syslog is set to 1 day so does not get much larger.

jdcoats commented 5 years ago

this problem exists even if syslogtomysql is disabled with no input.

jdcoats commented 5 years ago

for example there was just a run of routerconfigs during a poller cycle which has more db activity than both boost and the poller but no poller overruns example take place. just got lucky that boost and spine even though it ran long didn't run in the same polling cycle.

cigamit commented 5 years ago

This really looks like disk I/O to me. Its not about the # of Mbps written, its that we are writing lots of data to lots of different files, causing it to jump all over the disk, thus requiring a lot of seek times. Check your I/O Wait % in top when Boost is writing. This ofcourse is where SSDs really shine.

jdcoats commented 5 years ago

So Cacti 1.x only works with ssd’s? The spine and boost should be able to run at the same time. What I’m saying is that even after boost completes, the following spine run is held up.

—— Dan

On February 22, 2019 at 11:25:15 PM EST, Jimmy Conner notifications@github.com wrote: CAUTION! This email came from outside Tallahassee Memorial HealthCare. Exercise extra caution clicking links and opening attachments from any and all senders.

This really looks like disk I/O to me. Its not about the # of Mbps written, its that we are writing lots of data to lots of different files, causing it to jump all over the disk, thus requiring a lot of seek times. Check your I/O Wait % in top when Boost is writing. This ofcourse is where SSDs really shine.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Cacti/cacti/issues/2441#issuecomment-466615338, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AogODGKZ5e2gqiHmonh2rTveqxZ3OIpeks5vQMKigaJpZM4bIR00.

netniV commented 5 years ago

No, I am running on SATA drives but I do not have the vast amounts of data that you seem to be processing. When you are talking about large volumes of data, you need the speed/power of faster hardware. For Hard Disks, that is SSDs. For CPU, that's physical not virtual.

cigamit commented 5 years ago

It is likely just that, boost is taking I/O away from the physical disks likely due to it being so efficient at writing. However, if your disks are ext3 and even xfs, I've not seen this problem before.

I know you have a large number of disks in the array. Can you do a few things for us:

1) Check for a degraded status on your disk array 2) Describe your format type and layout (xfs, brtfs, extX, block size, etc.) 3) Run "iostat -k 1 device" during boost updates and let's see what you disk is doing. 4) Post the results of the iostat

jdcoats commented 5 years ago

15krpm sas raid 6 reiserfs imc2 imc1 disk.txt

ill get you the iostat next boost run

jdcoats commented 5 years ago

10.9.76.4820190223.log

iostat through boost run, then I stopped and started it after boost completed where spine ran slow for 2 polls

cigamit commented 5 years ago

When you expand top (hit the 1 key) on that device, how many cpu's does it show? Also, that was not 1 second, that was 5 or 10 second intervals right? Looks like 6 drives right? So, that means on a normal slow disk, you will get ~420 transactions per second. Get me those clarifications.

jdcoats commented 5 years ago

10 cpu, top - 12:27:27 up 21:42, 2 users, load average: 1.97, 2.21, 2.65 Tasks: 203 total, 2 running, 201 sleeping, 0 stopped, 0 zombie %Cpu0 : 1.0 us, 1.4 sy, 0.0 ni, 97.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 1.7 us, 1.7 sy, 0.0 ni, 96.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 54.5 us, 14.0 sy, 0.0 ni, 31.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 8.1 us, 2.7 sy, 0.0 ni, 85.5 id, 3.7 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 97.3 us, 0.3 sy, 0.0 ni, 2.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 : 15.3 us, 10.9 sy, 0.0 ni, 73.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 3.0 us, 1.0 sy, 0.0 ni, 96.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 54974.2 total, 301.8 free, 29889.5 used, 24782.8 buff/cache MiB Swap: 7640.3 total, 7640.3 free, 0.0 used. 24682.2 avail Mem

no that was 1sec intervals (iostat -k 1)

cigamit commented 5 years ago

Looks like 1 second, but odd.

cigamit commented 5 years ago

Well, from the looks of it, we just have to idle boost for non-ssd disks (aka slow it down). I have some thoughts about a 1.3 release that I can flip you. But won't make it into 1.2.2 release (today).

jdcoats commented 5 years ago

sounds good, ill definitely try it out. Thanks!

jdcoats commented 5 years ago

btw, those stats were during a mactrack poller run also, so thats about as busy as it gets with boost,mactrack,syslog and spine writing

cigamit commented 5 years ago

okay, so let's do the following: 1) Edit lib/boost.php 2) Goto ~line 814 you will see a call to boost_rrdtool_function_update 3) Add a call to 'usleep(100000)' right after that call, which will slow down the updates. There is one more call a little later, add the usleep there too.

There is a more interesting approach, but in the mean time, this will keep your poller more happy. Just tweak that number up until things settle down. Let us know what a good number ended up being for the usleep...

jdcoats commented 5 years ago

okay 100000 may be extreme, that took boost from 60 sec to 30 minutes and iowait never went above 1%. I'll continue to test different sleep times and let you know. Thanks!

jdcoats commented 5 years ago

okay, preliminary results are that usleep(80000); is best for my setup. I was able to change to 1hr/1500000 records and get a little separation between runs for now. boost2 boost boost3

netniV commented 5 years ago

Good work!

jdcoats commented 5 years ago

update: I do not think that this issue is completely identified or resolved. We have slowed down boost enough to prove that IO is not causing the issue itself but spine still hangs randomly and will cause this message. Now even when boost isn't running spine will go from 5 sec polls to +60sec polls and overrun. I have now changed spine from 1min/50 processes to 1min/10 processes to try to slow the hang ups. I think there is a bug its just that over the years we have just made things faster by design that they have overcome the disk speed? I have another server I can migrate too with ssd's to see if this is true but it will take a lot of work. It really seems like what has been working fine for 10 years which was migrated to new hardware 4 years ago and was running fine on 0.8.8h should be able to perform. I've seen numerous posts over the past year that were similar and assumed they had been resolved but, I'm still experiencing the same results. Instability and gaps in graphs despite trying to slow things down that shouldn't have to be slowed.

jdcoats commented 5 years ago

is it just ironic that in trying to make things faster over the years that now they are to fast for the average hardware? :) filtering for output and overrun

2019/02/25 16:40:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[H-1S-218-STCK-01 - Traffic] Graphs[H-1S-218-STCK-01-Gi2/0/7-D#A07/1428 CVSICU Nurse Station]

2019/02/25 16:31:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 2, DS[N5K3 - Traffic, N5K4 - Traffic] Graphs[N5K3-Ethernet101/1/1-Dell Archive Appliance, N5K4-port-channel50-ESXVDI021] 2019/02/25 16:29:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[N5K2 - Traffic] Graphs[N5K2-Ethernet1/13-Row1-Rack5 102] 2019/02/25 16:27:02 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[N5K4 - Traffic] Graphs[N5K4-Ethernet111/1/3-EMC SAN A Datamover 2 Gi0] 2019/02/25 13:47:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[M-NEEC-4510-02 - Traffic] Graphs[M-NEEC-4510-02-Gi1/37-D#R4-G48 Siemens Building Automation] 2019/02/25 12:22:02 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[H-1S-218-STCK-01 - Traffic] Graphs[H-1S-218-STCK-01-Gi1/0/17-D#D35/G06 PACS Workstation in Surgery X-] 2019/02/25 12:20:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[M-Pulmonary-STCK-01 - Traffic] Graphs[M-Pulmonary-STCK-01-Gi2/0/32-D#H26 155B MA Station] 2019/02/25 12:18:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[H-1B-1-STCK-01.TMH.ORG - Traffic] Graphs[H-1B-1-STCK-01.TMH.ORG-Gi2/0/17-D#E29 Blood Bank Phone] 2019/02/25 12:17:02 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[N5K4 - Traffic] Graphs[N5K4-Ethernet112/1/10-VNXCS0] 2019/02/25 12:15:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[N5K4 - Traffic] Graphs[N5K4-Ethernet111/1/9-EMC SAN A Control Manager] 2019/02/25 12:13:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[N5K4 - Traffic] Graphs[N5K4-port-channel104-Row2-Rack4] 2019/02/25 12:04:02 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[C-3POB-3-STCK-01.TMH.ORG - Traffic] Graphs[C-3POB-3-STCK-01.TMH.ORG-Gi2/0/1-D#D87 Printer P0021] 2019/02/25 11:54:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 435, DS[M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic, M-Pulmonary-STCK-01 - Traffic] Graphs[M-Pulmonary-STCK-01-Gi1/0/31-D#H15 204 MD, M-Pulmonary-STCK-01-Gi1/0/48-M-St-James-Ct-ISR4351-01, M-Pulmonary-STCK-01-Gi2/0/33-D#G46 199 PFT Mobile Cart, M-Pulmonary-STCK-01-Gi3/0/14-D#D30 138 Nurse Station E, M-Pulmonary-STCK-01-Gi3/0/27-D#G24 194 Viewing Room, M-Pulmonary-STCK-01-Gi3/0/32-D#H27 155B MA Station, M-Pulmonary-STCK-01-Gi4/0/11-D#C45 132 Exam D-3, M-Pulmonary-STCK-01-Gi4/0/26-D#G12 184 Nurse Station G, M-Pulmonary-STCK-01-Gi4/0/34-D#H45 206E MD, M-Pulmonary-STCK-01-Gi4/0/46-D#K23 WAP Pulmonary-AP09, M-Pulmonary-STCK-01-Gi5/0/3-D#A46 109 Nurse Station B, M-Pulmonary-STCK-01-Gi5/0/4-D#B14 111 Exam B-2, M-Pulmonary-STCK-01-Gi5/0/34-D#H47 206 Nurse Station A, M-Pulmonary-STCK-01-Gi6/0/12-D#D08 138 Nurse Station E, M-Pulmonary-STCK-01-Gi7/0/18-D#E29 157 P.T., M-Pulmonary-STCK-01-Gi7/0/20-D#F02 163 Billing, M-Pulmonary-STCK-01-Gi8/0/8-D#C14 126 Exam C-5, M-Pulmonary-STCK-01-Gi8/0/9-D#C26 129 Nurse Station D, M-Pulmonary-STCK-01-Gi8/0/20-D#F04 164 Office #2], Additional Issues Remain. Only showing first 20 2019/02/25 10:13:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[H-4B-1-STCK-01.TMH.ORG - Traffic] Graphs[H-4B-1-STCK-01.TMH.ORG-Gi1/0/8-D#G23 465 Physician Viewing Room] 2019/02/25 09:57:02 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[H-1S-218-STCK-01 - Traffic] Graphs[H-1S-218-STCK-01-Gi1/0/17-D#D35/G06 PACS Workstation in Surgery X-] 2019/02/25 09:51:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 2, DS[N5K4 - Traffic, N5K3 - Traffic] Graphs[N5K3-Ethernet114/1/7-Tintri 850 Controller2 Data, N5K4-Ethernet111/1/9-EMC SAN A Control Manager] 2019/02/25 09:49:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[N5K3 - Traffic] Graphs[N5K3-Ethernet102/1/11-CCURE1 Simplex/Grinnell Server] 2019/02/25 09:40:02 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[C-3POB-2-STCK-01.TMH.ORG - Traffic] Graphs[C-3POB-2-STCK-01.TMH.ORG-Gi5/0/31-D#E67 B-538 Printer P0066] 2019/02/25 03:36:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[H-Swisslog-STCK-01.TMH.ORG - Traffic] Graphs[H-Swisslog-STCK-01.TMH.ORG - Traffic - Te2/0/11]

2019/02/25 20:10:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate

2019/02/25 18:03:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 16:32:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 16:28:02 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 16:27:02 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 16:00:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 15:21:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 13:58:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 11:55:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 11:54:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 09:52:02 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 09:51:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 08:51:02 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 07:48:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 05:45:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 05:44:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 03:42:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 03:35:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate 2019/02/25 01:39:02 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate

jdcoats commented 5 years ago

ive slowed spine down to 1 process and 5 threads to see if it stops the random hang up

netniV commented 5 years ago

I think it's less to do with slowing things down and more to do with finding the source of the inter-process block. Obviously, slowing things down will reduce the likelihood of the block.

What timeouts do you have on devices? If you have 3 second timeouts for SNMP and PING, that can be 6 seconds for one device to be considered offline. So, if you end up with 10-15 of them being slow to respond, that can overrun instantly.

Additionally, if you have devices all respond in about 15ms for ping and 15ms for SNMP, then by my calculations (and i'm sure @cigamit will correct me if I'm wrong) that would be approx 33 devices/second. Over the course of 60 seconds that would be an average of about 2,000 devices. If you think back to our 6 second timeout about, that one timeout would be the equivalent of 198 devices.

netniV commented 5 years ago

Also, have you attempted to monitor CPU/Disk usage via top whilst all this is going on? If there is no block there, then it is either Network or MySQL related.

jdcoats commented 5 years ago

that's easier said than done when spine will run every minute for hours and then all of a sudden hang up for a polling cycle or 2. It would be nice if I could make it output individual device polling stats so maybe I could catch it hanging on a device or get a clue as to what is causing it.

netniV commented 5 years ago

That would be possible, though it would have to be in memory until the end so it could be written in bulk.

Cacti / cacti

Enhance boost to consume less I/O for traditional disk drives during boost updates #2441

2019/02/25 16:40:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 1, DS[H-1S-218-STCK-01 - Traffic] Graphs[H-1S-218-STCK-01-Gi2/0/7-D#A07/1428 CVSICU Nurse Station]

2019/02/25 20:10:01 - POLLER: Poller[Main Poller] WARNING: There are '1' detected as overrunning a polling cycle, please investigate