Cacti / cacti

Cacti ™
http://www.cacti.net
GNU General Public License v2.0
1.64k stars 405 forks source link

Cacti 1.1.36, recache event changes monitored net-snmp partition of host #1486

Closed namruf15 closed 6 years ago

namruf15 commented 6 years ago

Expected behavior: Cacti is measuring properly size of mounted partition of remote host, using NET-SNMP templates

Wrong behavior: When recache event occurs, Cacti changes monitored partition of remote host (OID address changes) because of what Cacti starts to monitor wrong partition.

Description: Hello, I'm using Cacti 1.1.36 installed on Debian 8 OS. I'm using the monitoring utility to check certain partition sizes of virtual machines. To do so, I've created bunch od devices inside Cacti where each of device is one VM. In next step I've generated few graphs, using Data Query [Net-SNMP - Get Monitored Partitions].

I've chosen two partitions to monitor (root and /home) but I have also few others on VM. When recache event occurs in Cacti logs then sometimes it happens that Cacti starts to point on wrong partition (it changes the OID address). For example after one recache event instead of /home partition Cacti is monitoring /tmp partition. This is very annoying because I'm also using thold plugin which sends email alerts to users when some thresholds are exceeded and when such wrong recache event occurs then completely different values are being checked by the plugin. Because of that user receive not proper email notifications.

Technical informations:

Technical Support [Summary] General Information Date Wed, 21 Mar 2018 09:51:22 +0100 Cacti Version 1.1.36 Cacti OS unix NET-SNMP Version NET-SNMP version: 5.7.2.1 RRDtool Version RRDtool 1.4.x Devices 45 Graphs 254 Data Sources Script/Command: 5 SNMP Get: 291 SNMP Query: 161 Script Query - Script Server: 1 Total: 458

Poller Information Interval 300 Type SPINE 1.1.36 Copyright 2004-2017 by The Cacti Group Items Action[0]: 613 Action[1]: 5 Action[2]: 2 Total: 620 Concurrent Processes 1 Max Threads 5 PHP Servers 5 Script Timeout 25 Max OID 10 Last Run Statistics Time:2.2583 Method:spine Processes:1 Threads:5 Hosts:45 HostsPerProcess:45 DataSources:620 RRDsProcessed:458 System Memory MemTotal 8.00 K MB MemFree 5.41 K MB Buffers 250.19 MB Cached 1.44 K MB Active 1.42 K MB Inactive 921.70 MB SwapTotal 3.81 K MB SwapFree 3.81 K MB PHP Information PHP Version 5.6.33-0+deb8u1 PHP OS Linux PHP uname Linux ta-026 3.16.0-5-amd64 #1 SMP Debian 3.16.51-3+deb8u1 (2018-01-08) x86_64 PHP SNMP Installed max_execution_time 30 memory_limit 128M

netniV commented 6 years ago

What recache method are you using?

namruf15 commented 6 years ago

Well, to be honest I don't know. I've only created suggested graphs from Devices menu and chosen the NET-SNMP mounted partition. I'm using Spine poller as it is stated above.

Sample output from logs: 03/21/2018 09:00:05 - PCOMMAND Device[5] Device[One_VM] WARNING: Recache Event Detected for Device

How can I check the method type about which you're asking?

cigamit commented 6 years ago

What is likely happening is that you are indexing on the index provided by the data query, and that this index changes upon restart. You should do a verbose query and post the contents. You can copy everything to the clipboard from the verbose query results.

namruf15 commented 6 years ago

@cigamit: Hello

and that this index changes upon restart

What restart do you have in mind? I observed that the OID address is changing on some (not every) recache events

You should do a verbose query and post the contents

Could you provide me an instruction how to do such verbose query? As I wrote above, I have only added devices to Cacti within built in NET-SNMP template and used provided NET-SNMP get mounted partition data query.

anarkia1976 commented 6 years ago

Actions --> yellow gear: image

namruf15 commented 6 years ago

When I click this verbose query gear, blank page appears:

obraz

One thing I noticed is the default value of re-index method selected to "Uptime" which is described as "When the device SNMP uptime go backwards a Re-Index will be performed":

obraz

Are you able to tell me if changing this option to None for example could be a workaround which disable this recache issue?

anarkia1976 commented 6 years ago

Yes, you can disable it and you can recache when you need this. Strange, for me this is the verbose output: image You have a possible problem.

cigamit commented 6 years ago

You need to check for JavaScript errors that are breaking your page. Goto your browsers developers toolbar while on the page, and then goto 'Console'. From there, go back to the page, and press the re-index button. Look for errors in the 'Console' after you press the verbose button. Post back what you find.

namruf15 commented 6 years ago

Ok, I can see some errors connected to permission denied. This happens to me also sometimes on other sites (when I click Devices or some other tabs). Refresh page resolves the issue. Error below:

obraz

netniV commented 6 years ago

To me, It almost sounds like a timeout issue. In the log where the PCOMMAND line was displayed, did you also see pairs of lines for

Recache for Device, data query <id>
Recache successful.
cigamit commented 6 years ago

The JavaScript issue has been reported all over the internet. It's likely one of your Firefox Plugins. You should start disabling until it goes away. I also saw some people saying that this was also a bug in Firefox prior to release 48. Not sure which is your case, but it's definitely browser or browser add-on related.

cigamit commented 6 years ago

The details were were looking for you still have not provided by the way. After the verbose query, there will be a copy icon in the verbose output at the very right. Click that icon, and your verbose output will be moved to the clipboard. Then paste that output here.

netniV commented 6 years ago

The alternative is to do it from the command line to verify the data is coming back OK. So assuming you are using Net-SNMP - Get Monitored Partitions via snmp v2, it would be:

snmpwalk -c <community> -v 2c <ip> 1.3.6.1.4.1.2021.9.1

This will bring back all the various fields we are interested in from the device. Assuming that returns OK, as @cigamit suggests, we would really need the debug output from the verbose query to see what's being interpreted.

I would also recommend using the above walk command before and after you mount the partitions that cause the issue to see what the differences are.

namruf15 commented 6 years ago

The verbose query for one of the devices for NET-SNMP get mounted partition:

Data Query Debug Information

Total: 0.000000, Delta: 0.000000, Running data query [3]. Total: 0.000000, Delta: 0.000000, Found type = '3' [SNMP Query]. Total: 0.000000, Delta: 0.000000, Found data query XML file at '/var/www/html/resource/snmp_queries/net-snmp_disk.xml' Total: 0.000000, Delta: 0.000000, XML file parsed ok. Total: 0.000000, Delta: 0.000000, missing in XML file, 'Index Count Changed' emulated by counting oid_index entries Total: 0.020000, Delta: 0.020000, Executing SNMP walk for list of indexes @ '.1.3.6.1.4.1.2021.9.1.1' Index Count: 12 Total: 0.020000, Delta: 0.000000, Index found at OID: '.1.3.6.1.4.1.2021.9.1.1.1' value: '1' Total: 0.020000, Delta: 0.000000, Index found at OID: '.1.3.6.1.4.1.2021.9.1.1.2' value: '2' Total: 0.020000, Delta: 0.000000, Index found at OID: '.1.3.6.1.4.1.2021.9.1.1.3' value: '3' Total: 0.020000, Delta: 0.000000, Index found at OID: '.1.3.6.1.4.1.2021.9.1.1.4' value: '4' Total: 0.020000, Delta: 0.000000, Index found at OID: '.1.3.6.1.4.1.2021.9.1.1.5' value: '5' Total: 0.020000, Delta: 0.000000, Index found at OID: '.1.3.6.1.4.1.2021.9.1.1.6' value: '6' Total: 0.020000, Delta: 0.000000, Index found at OID: '.1.3.6.1.4.1.2021.9.1.1.7' value: '7' Total: 0.020000, Delta: 0.000000, Index found at OID: '.1.3.6.1.4.1.2021.9.1.1.8' value: '8' Total: 0.020000, Delta: 0.000000, Index found at OID: '.1.3.6.1.4.1.2021.9.1.1.9' value: '9' Total: 0.020000, Delta: 0.000000, Index found at OID: '.1.3.6.1.4.1.2021.9.1.1.10' value: '10' Total: 0.020000, Delta: 0.000000, Index found at OID: '.1.3.6.1.4.1.2021.9.1.1.11' value: '11' Total: 0.020000, Delta: 0.000000, Index found at OID: '.1.3.6.1.4.1.2021.9.1.1.12' value: '12'

Click to show Data Query output for field 'dskIndex'

Total: 0.020000, Delta: 0.000000, Located input field 'dskIndex' [walk] Total: 0.040000, Delta: 0.020000, Executing SNMP walk for data @ '.1.3.6.1.4.1.2021.9.1.1' Found item [dskIndex='1'] index: 1 [from value] Found item [dskIndex='2'] index: 2 [from value] Found item [dskIndex='3'] index: 3 [from value] Found item [dskIndex='4'] index: 4 [from value] Found item [dskIndex='5'] index: 5 [from value] Found item [dskIndex='6'] index: 6 [from value] Found item [dskIndex='7'] index: 7 [from value] Found item [dskIndex='8'] index: 8 [from value] Found item [dskIndex='9'] index: 9 [from value] Found item [dskIndex='10'] index: 10 [from value] Found item [dskIndex='11'] index: 11 [from value] Found item [dskIndex='12'] index: 12 [from value]

Click to show Data Query output for field 'dskPath'

Total: 0.040000, Delta: 0.000000, Located input field 'dskPath' [walk] Total: 0.060000, Delta: 0.020000, Executing SNMP walk for data @ '.1.3.6.1.4.1.2021.9.1.2' Found item [dskPath='/'] index: 1 [from value] Found item [dskPath='/var'] index: 2 [from value] Found item [dskPath='/'] index: 3 [from value] Found item [dskPath='/run'] index: 4 [from value] Found item [dskPath='/dev/shm'] index: 5 [from value] Found item [dskPath='/run/lock'] index: 6 [from value] Found item [dskPath='/sys/fs/cgroup'] index: 7 [from value] Found item [dskPath='/boot'] index: 8 [from value] Found item [dskPath='/opt'] index: 9 [from value] Found item [dskPath='/home'] index: 10 [from value] Found item [dskPath='/tmp'] index: 11 [from value] Found item [dskPath='/media/burak'] index: 12 [from value]

Click to show Data Query output for field 'dskDevice'

Total: 0.060000, Delta: 0.000000, Located input field 'dskDevice' [walk] Total: 0.080000, Delta: 0.020000, Executing SNMP walk for data @ '.1.3.6.1.4.1.2021.9.1.3' Found item [dskDevice='/dev/dm-0'] index: 1 [from value] Found item [dskDevice=''] index: 2 [from value] Found item [dskDevice='/dev/dm-0'] index: 3 [from value] Found item [dskDevice='tmpfs'] index: 4 [from value] Found item [dskDevice='tmpfs'] index: 5 [from value] Found item [dskDevice='tmpfs'] index: 6 [from value] Found item [dskDevice='tmpfs'] index: 7 [from value] Found item [dskDevice='/dev/xvda1'] index: 8 [from value] Found item [dskDevice='/dev/mapper/vg_sys-lv_opt'] index: 9 [from value] Found item [dskDevice='/dev/mapper/vg_sys-lv_home'] index: 10 [from value] Found item [dskDevice='/dev/mapper/vg_sys-lv_tmp'] index: 11 [from value] Found item [dskDevice='//10.42.248.2/IAV_WCDMA'] index: 12 [from value] Total: 0.080000, Delta: 0.000000, Update data query sort cache complete Total: 0.080000, Delta: 0.000000, Updated data query index ordering Total: 0.090000, Delta: 0.000000, Update re-index cache complete Total: 0.090000, Delta: 0.000000, Update graph data query cache complete Total: 0.090000, Delta: 0.000000, Update data source data query cache complete Total: 0.090000, Delta: 0.000000, Update data query cache complete Total: 0.090000, Delta: 0.010000, Update poller cache from query complete Total: 0.090000, Delta: 0.000000, Automation execute data query complete Total: 0.090000, Delta: 0.000000, Plugin hooks complete

netniV commented 6 years ago

Interesting that your root partition seems to be mapped twice. I've not seen that before. So if you managed the verbose query on a different device are you still getting the timeout when verbose querying the problematic device?

cigamit commented 6 years ago

Yea, that is going to mess up the sorting. Likely a config issue. We can not fix a PEBKAC.

namruf15 commented 6 years ago

@netniV The timeout issue was connected with my browser so it was environmental problem not connected to SNMP. Are you able to tell me how can I debug this issue and get rid of doubled root partition in SNMP response? Performing df -h returns proper partitions amount:

user@user:~$ df -h Filesystem Size Used Avail Use% Mounted on /dev/dm-0 9.1G 4.6G 4.1G 53% / udev 10M 0 10M 0% /dev tmpfs 1.6G 9.2M 1.6G 1% /run tmpfs 4.0G 19M 3.9G 1% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs 4.0G 0 4.0G 0% /sys/fs/cgroup /dev/xvda1 180M 55M 112M 33% /boot /dev/mapper/vg_sys-lv_opt 44G 2.1G 40G 6% /opt /dev/mapper/vg_sys-lv_tmp 14G 36M 13G 1% /tmp /dev/mapper/vg_sys-lv_home 28G 11G 18G 39% /home //10.45.249.2/SHARE 311T 276T 35T 89% /media/share1 tmpfs 800M 24K 800M 1% /run/user/1000

netniV commented 6 years ago

OK so the simple answer to this is that you are asking net-snmp to include it more than once. Now, you personally, probably haven't touched anything of that. But because you are using the default package, I guarantee that it shows:

disk / 10000
includeAllDisks 10%

Or something similar in /etc/snmp/snmpd.conf (your OS may be in a slightly different location). As soon as I removed the disk / 10000 by adding a hash at the front, my issue went away.

namruf15 commented 6 years ago

@netniV : After I change this on all affected VMs hosts should I remove and add the device again to Cacti? Or perform some other activity to let Cacti distinguish the change (despite snmpd restart on VM of course)?

netniV commented 6 years ago

The change should just be picked up on the next polling cycle.

namruf15 commented 6 years ago

Interesting - it is working :). I hope that partition won't change again in few days. If everything will be ok then I will mark the issue as closed, thanks!

namruf15 commented 6 years ago

Hello, looks that your advice resolved my issue. Thanks for all help!

netniV commented 6 years ago

Cool. If you can close the issue that would help :)

cigamit commented 6 years ago

I'll go ahead and close. Almost done trolling for the day anyway.