Closed topinet closed 7 years ago
Another example of check filesystems on linux, where each executuon copies same data into every metric, randomly:
> select time,service,performanceLabel,value,warn,crit from metrics where host='atenea' and service='[1][FILESYSTEMS]' and time > now() - 35m
name: metrics
time service performanceLabel value warn crit
---- ------- ---------------- ----- ---- ----
1480678785000000000 [1][FILESYSTEMS] / 12734.1562 22169 24941
1480678785000000000 [1][FILESYSTEMS] /dev/shm 12734.1562 22169 24941
1480678785000000000 [1][FILESYSTEMS] /boot 12734.1562 22169 24941
1480679085000000000 [1][FILESYSTEMS] /dev/shm 0 3246 3652
1480679085000000000 [1][FILESYSTEMS] /boot 0 3246 3652
1480679085000000000 [1][FILESYSTEMS] / 0 3246 3652
1480679385000000000 [1][FILESYSTEMS] /boot 12735.2695 22169 24941
1480679385000000000 [1][FILESYSTEMS] / 12735.2695 22169 24941
1480679385000000000 [1][FILESYSTEMS] /dev/shm 12735.2695 22169 24941
1480679685000000000 [1][FILESYSTEMS] / 0 3246 3652
1480679685000000000 [1][FILESYSTEMS] /dev/shm 0 3246 3652
1480679685000000000 [1][FILESYSTEMS] /boot 0 3246 3652
1480679985000000000 [1][FILESYSTEMS] /dev/shm 0 3246 3652
1480679985000000000 [1][FILESYSTEMS] /boot 0 3246 3652
1480679985000000000 [1][FILESYSTEMS] / 0 3246 3652
1480680285000000000 [1][FILESYSTEMS] /boot 11.5967 79 89
1480680285000000000 [1][FILESYSTEMS] / 11.5967 79 89
1480680285000000000 [1][FILESYSTEMS] /dev/shm 11.5967 79 89
That's odd... that's the first time I've head such strange behaviour.
Here is also an example on thruk with load as service: https://demo.thruk.org/thruk/#cgi-bin/extinfo.cgi?type=2&host=icinga2&service=load&backend=cacb0#histou_th2/1480777124/1480867124/1 Which is nearly the same as yours, except the underscores which are not critical.
There is also an debug output in Nagflux which shows, which data is send to the Influxdb, could you post that one too please.
Hi,
Nagflux debug is in the post, if other please specify.
Last friday I observed also happens with check_icmp, pl metric gets the value from rta (compared to warning threshlld was so little that seemed a plain zero value).
Hi,
I saw that one, but there is also an debugoutput which shows the requests sent to the InfluxDB. Because it seems that, the data arrives in proper shape at nagflux, the question is now, where comes the crap from.
Greetings, Philip
How can I configure such debugoutput?
Nagflux doesn't have any command-line option and log level TRACE only shows the previous output.
Maybe is this the output you ask for?
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_1_min,crit-fill=none,warn-fill=none crit=14.50,value=0.05,warn=5.50 1481096865000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_1_min,crit-fill=none,warn-fill=none warn=5.50,crit=14.50,value=0.05 1481096925000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_1_min,warn-fill=none,crit-fill=none warn=5.50,crit=14.50,value=0.05 1481096985000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_5_min,warn-fill=none,crit-fill=none warn=5.50,crit=14.50,value=0.05 1481096865000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_15_min,warn-fill=none,crit-fill=none value=0.05,warn=5.50,crit=14.50 1481096865000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_5_min,warn-fill=none,crit-fill=none crit=14.50,value=0.05,warn=5.50 1481096925000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_15_min,warn-fill=none,crit-fill=none value=0.05,warn=5.50,crit=14.50 1481096925000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_5_min,warn-fill=none,crit-fill=none value=0.05,warn=5.50,crit=14.50 1481096985000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_15_min,warn-fill=none,crit-fill=none value=0.05,warn=5.50,crit=14.50 1481096985000
2016-12-07 08:50:44 Debug: [ModGearman] map[HOSTNAME:server SERVICEDESC:[1][LOAD] SERVICEPERFDATA:load_1_min=0.20;5.75;14.75 load_5_min=0.11;5.60;14.60 load_15_min=0.06;5.50;14.50 SERVICECHECKCOMMAND:check_load_solaris_linux_by_snmp!5.75,5.60,5.50!14.75,14.60,14.50 SERVICESTATE:0 SERVICESTATETYPE:1
Yes that's the one, I was asking for. But this doesn't look good either...
Is this an OMD installation? It seems you are using mod_gearman as source. Are you using one queue or multiple? Has Nagflux access to spoolfiles(from a local Nagios) which may interfere?
I'm using Nagflux from github, no OMD.
My installation is Naemon + Mod-Gearman (latest versions)
I'm posting my config.gcfg if it serves you to address the problem:
[main]
NagiosSpoolfileFolder = "/var/lib/naemon/spool"
NagiosSpoolfileWorker = 0
InfluxWorker = 2
MaxInfluxWorker = 5
DumpFile = "/var/lib/nagflux/nagflux.dump"
NagfluxSpoolfileFolder = "/var/lib/nagflux/spool"
FieldSeparator = "&"
BufferSize = 100000
[ModGearman "perfdata"]
Enabled = true
Address = "127.0.0.1:4730"
Queue = "grafana"
Secret = "XXXXXXXXXXXXXXX"
Worker = 2
[Influx]
Enabled = true
Version = 0.9
Address = "http://127.0.0.1:8086"
Arguments = "precision=ms&u=nagflux&p=XXXXXXXXXXX&db=perfdata"
CreateDatabaseIfNotExists = true
NastyString = ""
NastyStringToReplace = ""
HostcheckAlias = "hostcheck"
[Livestatus]
Type = "file"
Address = "/var/cache/naemon/live"
MinutesToWait = 2
OK, I'll have to see how I can reproduce this bug.
I can confirm the bug, I'll try to fix it by time.
I made a new release try that one, it should be fixed now.
Fix confirmed, thanks!
I'm using Nagflux (v0.2.9) with InfluxDB 1.1.0-1, and I have problems with check_snmpload.pl performance data. While in Thruk and Nagflux logs I can see the correct data, when it's saved into InfluxDB data by Nagflux all load*_min gets the same value (also for warn and crit thresholds)
Thruk shows:
load_1_min=0.08;5.75;14.75 load_5_min=0.08;5.60;14.60 load_15_min=0.05;5.50;14.50
Nagflux shows in its log:
But when I check InfluxDB database, load_1_min, load_5_min and load_15_min show same values registered (those for load_15_min):
Also happens with check_wmi_plus.pl (mode checkeachcpu) which returns perfdata as:
'Avg Utilisation CPU0'=0.2%;90;95; 'Avg Utilisation CPU1'=0.6%;90;95; 'Avg Utilisation CPU2'=1.9%;90;95; 'Avg Utilisation CPU3'=0.2%;90;95; 'Avg Utilisation CPU_Total'=0.7%;90;95;
Other checks, like check_icmp (returning rta and pl values) save data correctly, so I guess it's a problem with similar performanceLabel.