Griesbacher / nagflux

A connector which copies performancedata from Nagios / Icinga(2) / Naemon to InfluxDB
GNU General Public License v2.0
65 stars 30 forks source link

Nagflux saves data incorrectly into InfluxDB #20

Closed topinet closed 7 years ago

topinet commented 7 years ago

I'm using Nagflux (v0.2.9) with InfluxDB 1.1.0-1, and I have problems with check_snmpload.pl performance data. While in Thruk and Nagflux logs I can see the correct data, when it's saved into InfluxDB data by Nagflux all load*_min gets the same value (also for warn and crit thresholds)

Thruk shows: load_1_min=0.08;5.75;14.75 load_5_min=0.08;5.60;14.60 load_15_min=0.05;5.50;14.50

Nagflux shows in its log:

2016-12-02 08:11:37 Debug: [ModGearman] map[SERVICEDESC:[1][LOAD] SERVICEPERFDATA:load_1_min=0.08;5.75;14.75 load_5_min=0.08;5.60;14.60 load_15_min=0.05;5.50;14.50 SERVICECHEC
KCOMMAND:check_load_solaris_linux_by_snmp!5.75,5.60,5.50!14.75,14.60,14.50 SERVICESTATE:0 SERVICESTATETYPE:1
SERVICEINTERVAL::1.000000

DATATYPE:SERVICEPERFDATA TIMET:1480662697 HOSTNAME:server]

But when I check InfluxDB database, load_1_min, load_5_min and load_15_min show same values registered (those for load_15_min):

> select time,host,service,performanceLabel,value,warn,crit from metrics where host='server' and command='check_load_solaris_linux_by_snmp' and time = 1480662697000000000
name: metrics
time            host        service     performanceLabel    value   warn    crit
----            ----        -------     ----------------    -----   ----    ----
1480662697000000000 server      [1][LOAD]   load_15_min     0.05    5.5 14.5
1480662697000000000 server      [1][LOAD]   load_5_min      0.05    5.5 14.5
1480662697000000000 server      [1][LOAD]   load_1_min      0.05    5.5 14.5

Also happens with check_wmi_plus.pl (mode checkeachcpu) which returns perfdata as:

'Avg Utilisation CPU0'=0.2%;90;95; 'Avg Utilisation CPU1'=0.6%;90;95; 'Avg Utilisation CPU2'=1.9%;90;95; 'Avg Utilisation CPU3'=0.2%;90;95; 'Avg Utilisation CPU_Total'=0.7%;90;95;

Other checks, like check_icmp (returning rta and pl values) save data correctly, so I guess it's a problem with similar performanceLabel.

topinet commented 7 years ago

Another example of check filesystems on linux, where each executuon copies same data into every metric, randomly:

> select time,service,performanceLabel,value,warn,crit from metrics where host='atenea' and service='[1][FILESYSTEMS]' and time > now() - 35m
name: metrics
time            service         performanceLabel    value       warn    crit
----            -------         ----------------    -----       ----    ----
1480678785000000000 [1][FILESYSTEMS]    /           12734.1562  22169   24941
1480678785000000000 [1][FILESYSTEMS]    /dev/shm        12734.1562  22169   24941
1480678785000000000 [1][FILESYSTEMS]    /boot           12734.1562  22169   24941
1480679085000000000 [1][FILESYSTEMS]    /dev/shm        0       3246    3652
1480679085000000000 [1][FILESYSTEMS]    /boot           0       3246    3652
1480679085000000000 [1][FILESYSTEMS]    /           0       3246    3652
1480679385000000000 [1][FILESYSTEMS]    /boot           12735.2695  22169   24941
1480679385000000000 [1][FILESYSTEMS]    /           12735.2695  22169   24941
1480679385000000000 [1][FILESYSTEMS]    /dev/shm        12735.2695  22169   24941
1480679685000000000 [1][FILESYSTEMS]    /           0       3246    3652
1480679685000000000 [1][FILESYSTEMS]    /dev/shm        0       3246    3652
1480679685000000000 [1][FILESYSTEMS]    /boot           0       3246    3652
1480679985000000000 [1][FILESYSTEMS]    /dev/shm        0       3246    3652
1480679985000000000 [1][FILESYSTEMS]    /boot           0       3246    3652
1480679985000000000 [1][FILESYSTEMS]    /           0       3246    3652
1480680285000000000 [1][FILESYSTEMS]    /boot           11.5967     79  89
1480680285000000000 [1][FILESYSTEMS]    /           11.5967     79  89
1480680285000000000 [1][FILESYSTEMS]    /dev/shm        11.5967     79  89
Griesbacher commented 7 years ago

That's odd... that's the first time I've head such strange behaviour.

Here is also an example on thruk with load as service: https://demo.thruk.org/thruk/#cgi-bin/extinfo.cgi?type=2&host=icinga2&service=load&backend=cacb0#histou_th2/1480777124/1480867124/1 Which is nearly the same as yours, except the underscores which are not critical.

There is also an debug output in Nagflux which shows, which data is send to the Influxdb, could you post that one too please.

topinet commented 7 years ago

Hi,

Nagflux debug is in the post, if other please specify.

Last friday I observed also happens with check_icmp, pl metric gets the value from rta (compared to warning threshlld was so little that seemed a plain zero value).

Griesbacher commented 7 years ago

Hi,

I saw that one, but there is also an debugoutput which shows the requests sent to the InfluxDB. Because it seems that, the data arrives in proper shape at nagflux, the question is now, where comes the crap from.

Greetings, Philip

topinet commented 7 years ago

How can I configure such debugoutput?

Nagflux doesn't have any command-line option and log level TRACE only shows the previous output.

topinet commented 7 years ago

Maybe is this the output you ask for?

metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_1_min,crit-fill=none,warn-fill=none crit=14.50,value=0.05,warn=5.50 1481096865000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_1_min,crit-fill=none,warn-fill=none warn=5.50,crit=14.50,value=0.05 1481096925000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_1_min,warn-fill=none,crit-fill=none warn=5.50,crit=14.50,value=0.05 1481096985000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_5_min,warn-fill=none,crit-fill=none warn=5.50,crit=14.50,value=0.05 1481096865000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_15_min,warn-fill=none,crit-fill=none value=0.05,warn=5.50,crit=14.50 1481096865000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_5_min,warn-fill=none,crit-fill=none crit=14.50,value=0.05,warn=5.50 1481096925000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_15_min,warn-fill=none,crit-fill=none value=0.05,warn=5.50,crit=14.50 1481096925000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_5_min,warn-fill=none,crit-fill=none value=0.05,warn=5.50,crit=14.50 1481096985000
metrics,host=server,service=[1][LOAD],command=check_load_solaris_linux_by_snmp,performanceLabel=load_15_min,warn-fill=none,crit-fill=none value=0.05,warn=5.50,crit=14.50 1481096985000
2016-12-07 08:50:44 Debug: [ModGearman] map[HOSTNAME:server SERVICEDESC:[1][LOAD] SERVICEPERFDATA:load_1_min=0.20;5.75;14.75 load_5_min=0.11;5.60;14.60 load_15_min=0.06;5.50;14.50 SERVICECHECKCOMMAND:check_load_solaris_linux_by_snmp!5.75,5.60,5.50!14.75,14.60,14.50 SERVICESTATE:0 SERVICESTATETYPE:1
Griesbacher commented 7 years ago

Yes that's the one, I was asking for. But this doesn't look good either...

Is this an OMD installation? It seems you are using mod_gearman as source. Are you using one queue or multiple? Has Nagflux access to spoolfiles(from a local Nagios) which may interfere?

topinet commented 7 years ago

I'm using Nagflux from github, no OMD.

My installation is Naemon + Mod-Gearman (latest versions)

I'm posting my config.gcfg if it serves you to address the problem:

[main]
    NagiosSpoolfileFolder = "/var/lib/naemon/spool"
    NagiosSpoolfileWorker = 0
    InfluxWorker = 2
    MaxInfluxWorker = 5
    DumpFile = "/var/lib/nagflux/nagflux.dump"
    NagfluxSpoolfileFolder = "/var/lib/nagflux/spool"
    FieldSeparator = "&"
    BufferSize = 100000

[ModGearman "perfdata"]
    Enabled = true
    Address = "127.0.0.1:4730"
    Queue = "grafana"
    Secret = "XXXXXXXXXXXXXXX"
    Worker = 2

[Influx]
    Enabled = true
    Version = 0.9
    Address = "http://127.0.0.1:8086"
    Arguments = "precision=ms&u=nagflux&p=XXXXXXXXXXX&db=perfdata"
    CreateDatabaseIfNotExists = true
    NastyString = ""
    NastyStringToReplace = ""
    HostcheckAlias = "hostcheck"

[Livestatus]
    Type = "file"
    Address = "/var/cache/naemon/live"
    MinutesToWait = 2
Griesbacher commented 7 years ago

OK, I'll have to see how I can reproduce this bug.

Griesbacher commented 7 years ago

I can confirm the bug, I'll try to fix it by time.

Griesbacher commented 7 years ago

I made a new release try that one, it should be fixed now.

topinet commented 7 years ago

Fix confirmed, thanks!