NagiosEnterprises / ncpa

Nagios Cross-Platform Agent
Other
177 stars 95 forks source link

NCPA 2.3.1 Bug Perf Data #764

Open timcanty opened 3 years ago

timcanty commented 3 years ago

Compared to the previous 2.3.0 version the latest 2.3.1 is not returning the same data,

The previous nagios check: check_command check_ncpa!-M 'memory/virtual' -w 80 -c 90

Is now not returning all the performance data, so the nagios graphing is not working any more for any hosts on the new version. In the below screen shot, you will see the performance data returned in the latest version is the top line, and the bottom line is the previous version, could this be looked into please.

image

if you need any further information please let us know

jomann09 commented 3 years ago

I don't think that the change for warn/crit would cause it to not graph, although I am not 100% certain on that. The reason that the warn/crit are no longer in the perfdata is that they are wrong.

timcanty commented 3 years ago

well the machines on version 2.3.0 are still correctly showing in the graph vs the ones that are no longer displaying in the graph are on 2.3.1, and that is the only noticeable difference that nagios is feeding it, so i do believe the lack of warn/crit is causing the graphs not to update.

Just for confirmation are you saying that the warn/crit data we are passing it are wrong, or ncpa isn't sending the correct values back

jomann09 commented 3 years ago

The warning/critical data, the 6 and 7 in your data are wrong, those values are supposed to represent the warning or critical value of the data being shown but since the actual warning/critical values you pass to the check are not related to those perfdata fields they should not have a warning/critical number applied to them. I will check the graphing on my system in a little bit.

timcanty commented 3 years ago

@jomann09 wondering if you had chance to check your graphing at all?

Guyver1wales commented 3 years ago

I can confirm this issue on all our RHEL/Centos Servers. We noticed this today as we dont have many issues with disk space so its not something we need to check that regularly but I can confirm since updating to 2.3.1 FREE SPACE graphing is broken:

nagiosgraph errors: ` Fri May 28 08:14:29 2021 insert.pl 10630 error RRDs::update ERR /var/nagios/rrd/MYSERVERNAME/%2Flocalcache%20Partition%20W90%20C95used.rrd: expected 3 data source readings (got 2) from 1622186067 Fri May 28 08:14:29 2021 insert.pl 10630 error ds = [ '/var/nagios/rrd/MYSERVERNAME/%2Flocalcache%20Partition%20W90%20C95used.rrd', '1622186067:0.04' ]; Fri May 28 08:14:29 2021 insert.pl 10630 error RRDs::update ERR /var/nagios/rrd/MYSERVERNAME/%2Flocalcache%20Partition%20W90%20C95free.rrd: expected 3 data source readings (got 2) from 1622186067 Fri May 28 08:14:29 2021 insert.pl 10630 error ds = [ '/var/nagios/rrd/MYSERVERNAME/%2Flocalcache%20Partition%20W90%20C95free.rrd', '1622186067:4.95' ]; Fri May 28 08:14:29 2021 insert.pl 10630 error RRDs::update ERR /var/nagios/rrd/MYSERVERNAME/%2Flocalcache%20Partition%20W90%20C95total.rrd: expected 3 data source readings (got 2) from 1622186067 Fri May 28 08:14:29 2021 insert.pl 10630 error ds = [ '/var/nagios/rrd/MYSERVERNAME/%2Flocalcache%20Partition%20W90%20C95total.rrd',

`

yum history: yum history info 239 Loaded plugins: langpacks, product-id, search-disabled-repos, subscription-manager Transaction ID : 239 Begin time : Fri Feb 12 09:31:48 2021 Begin rpmdb : 824:1e06719fe4c76fc4970c115943e13e42543c828d End time : 09:32:04 2021 (16 seconds) End rpmdb : 824:8b4ba3c92a36e6fb054ed94829e13c084beceba7 User : root <root> Return-Code : Success Command Line : -y update --skip-broken Transaction performed with: Installed rpm-4.11.3-45.el7.x86_64 @rhel-7-server-rpms Updated subscription-manager-1.24.45-1.el7_9.x86_64 @rhel-7-server-rpms Installed yum-3.4.3-168.el7.noarch @rhel-7-server-rpms Installed yum-metadata-parser-1.1.4-10.el7.x86_64 @anaconda/7.0 Packages Altered: Updated ncpa-2.3.0-1.el7.x86_64 @nagios-base Update 2.3.1-1.el7.x86_64 @nagios-base history info

graph fails exactly when NCPA is updated: nagiosgraph-error

same server and partition READ BYTES KB working as expected nagiosgraph2

timcanty commented 3 years ago

I can also confirm this is not affecting our Windows Servers at all

we are seeing this on our windows servers, are the windows servers upgraded to 2.3.1? and our issues was with 'memory/virtual' rather the disk space

Guyver1wales commented 3 years ago

I can also confirm this is not affecting our Windows Servers at all

we are seeing this on our windows servers, are the windows servers upgraded to 2.3.1? and our issues was with 'memory/virtual' rather the disk space

apologies, you are correct, my windows servers are still running an older version so I have deleted that comment to avoid confusion

I just checked the same server that has the free space graphing issues and the memory usage graph is reporting as expected and is working

timcanty commented 3 years ago

could you let us know where you was seeing the errors for "nagiosgraph error" that you referred too, so we can review our logs too

Guyver1wales commented 3 years ago

they are in the nagiosgraph.log file. in my installation this is /var/nagios/nagiosgraph.log

check your config files for your location.

warning , that log file can be enourmous (I had to truncate mine as it was over 3GB)

so I used tail nagiosgraph.log -n 100

timcanty commented 3 years ago

Did some further digging on this, and looks like you can get the graphing working again on each host/service, but to do this, you have to delete the rrd for the failing services, this obviously means you loose all the previous data for that service/host.

for example /usr/local/nagiosgraph/var/rrd/HOSTNAME/MEMORY*.rrd

I think this would be resolvable if ncpa would send through 0's instead of not sending through anything where they have removed these parameters. But not sure if that's likely to happen.

ccztux commented 3 years ago

Can anyone please provide the output of some affected rrd files?:

rrdtool info <rrd_file>

Thanks!

Guyver1wales commented 3 years ago

I've just started a week of annual leave so cant provide any rrd output for a week

Guyver1wales commented 3 years ago

here you go: filename = "%2F%20Partition%20W90%20C95___used.rrd" rrd_version = "0003" step = 1800 last_update = 1613122090 ds[data].type = "GAUGE" ds[data].minimal_heartbeat = 3600 ds[data].min = NaN ds[data].max = NaN ds[data].last_ds = "3.53" ds[data].value = 5.9439000000e+03 ds[data].unknown_sec = 0 ds[warn].type = "GAUGE" ds[warn].minimal_heartbeat = 3600 ds[warn].min = NaN ds[warn].max = NaN ds[warn].last_ds = "28" ds[warn].value = 4.7320000000e+04 ds[warn].unknown_sec = 0 ds[crit].type = "GAUGE" ds[crit].minimal_heartbeat = 3600 ds[crit].min = NaN ds[crit].max = NaN ds[crit].last_ds = "29" ds[crit].value = 4.9010000000e+04 ds[crit].unknown_sec = 0 rra[0].cf = "AVERAGE" rra[0].rows = 600 rra[0].cur_row = 24 rra[0].pdp_per_row = 1 rra[0].xff = 5.0000000000e-01 rra[0].cdp_prep[0].value = NaN rra[0].cdp_prep[0].unknown_datapoints = 0 rra[0].cdp_prep[1].value = NaN rra[0].cdp_prep[1].unknown_datapoints = 0 rra[0].cdp_prep[2].value = NaN rra[0].cdp_prep[2].unknown_datapoints = 0 rra[1].cf = "AVERAGE" rra[1].rows = 700 rra[1].cur_row = 577 rra[1].pdp_per_row = 6 rra[1].xff = 5.0000000000e-01 rra[1].cdp_prep[0].value = 0.0000000000e+00 rra[1].cdp_prep[0].unknown_datapoints = 0 rra[1].cdp_prep[1].value = 0.0000000000e+00 rra[1].cdp_prep[1].unknown_datapoints = 0 rra[1].cdp_prep[2].value = 0.0000000000e+00 rra[1].cdp_prep[2].unknown_datapoints = 0 rra[2].cf = "AVERAGE" rra[2].rows = 775 rra[2].cur_row = 764 rra[2].pdp_per_row = 24 rra[2].xff = 5.0000000000e-01 rra[2].cdp_prep[0].value = 6.3180000000e+01 rra[2].cdp_prep[0].unknown_datapoints = 0 rra[2].cdp_prep[1].value = 5.0400000000e+02 rra[2].cdp_prep[1].unknown_datapoints = 0 rra[2].cdp_prep[2].value = 5.2200000000e+02 rra[2].cdp_prep[2].unknown_datapoints = 0 rra[3].cf = "AVERAGE" rra[3].rows = 797 rra[3].cur_row = 24 rra[3].pdp_per_row = 288 rra[3].xff = 5.0000000000e-01 rra[3].cdp_prep[0].value = 7.3160944444e+02 rra[3].cdp_prep[0].unknown_datapoints = 0 rra[3].cdp_prep[1].value = 5.8800000000e+03 rra[3].cdp_prep[1].unknown_datapoints = 0 rra[3].cdp_prep[2].value = 6.0900000000e+03 rra[3].cdp_prep[2].unknown_datapoints = 0

Guyver1wales commented 3 years ago

graph for the same rrd data above: image

jomann09 commented 3 years ago

Removing the RRD graph does fix the issue, because it does not have values for warning/critical, they should not be 0 because that would still be the wrong value. You could potentially make an option in the check that would allow it to fill the warning/critical value of the perfdata with 0s that would then allow you to continue using the old RRDs that have bad data and have it zero out the warning/critical values.

Guyver1wales commented 3 years ago

can this be fixed WITHOUT losing over two years of graphing data?! I have no desire to lose historical data from the actual nagios GUI and ask engineers to check some folder somewhere. I've yet to see anything in this thread that this will be fixed in the next release?

jomann09 commented 3 years ago

I can create an option in check_ncpa that will put the perfdata into it for people who want it to give out the warn/crit values into those perfdata points, that would at least solve the issue for you. It won't be a next release thing it would just be an update check_ncpa.py plugin.