[question] using OMD with nagflux grafana influxdb

ghost commented 8 years ago

Hello,

I am using OMD with nagflux grafana and influxdb and experiencing very strange problem that any operations regarding log files takes very long time to respond. (i.e. in Thruk clicking Notifications, Alerts, Availability etc.) takes like 400 seconds to list Notifications. When i restart naemon/nagios/icinga (does not matter which core i use) then it is back to normal for 5 minutes and then again it takes 400 seconds or so to list Notifications or Alerts. After I disable either nagflux or influxdb then it is back to normal fast response time. How can it be related? what nagflux does to interfere with logs processing? Any thoughts would be helpful. thanks

Griesbacher commented 8 years ago

Hi, the only connection between Nagflux and the core is Livestatus, to logs(real files) it's only the spoolfiles. I'm not absolutely sure but I think Thruk is not using the logfiles instead it's also using Livestatus. Does go back to normal when you just start the Influxdb?

ghost commented 8 years ago

When i disable nagflux OR influxdb it is OK. I think because no actions are taken with livestatus when one of them (nagflux or influxdb) is down. I am trying to find out how to debug this further, how to approach p.s. i have noticed that when using influxdb+nagflux the core (nagios/naemon) cpu usage is 100% constantly

Griesbacher commented 8 years ago

Nagflux stops automatically when you stop the influxdb.

The CPU usage is strange, we haven't noticed such behaviour so far. But I didn't test it with neamon at all, but if you tried nagios too then it's maybe no problem with the core.

Like I said the only interaction of nagflux is:

Livestatus
Filehandle (spoolfiles: ~/var/pnp4nagios/spool/)
HTTP Influxdb

Is your installation very huge? But that's just a shot in the dark.

jwesterholt commented 8 years ago

Hello,

we also see this problem. After some debuggen/strace we found the reason.

In an instance without any hosts the cpu usage of nagios is about 0,6%. In another instance without host which was used before we have a cpu usage about 30 %.

The strace shows that the archive of nagios is loaded in regular intervals. When deleting the nagios history the cpu usage stays at the normal level.

@Griesbacher: Do you have any idea what is causing this behaviour?

Jonathan

Griesbacher commented 8 years ago

Hi, @jwesterholt So you mean the load is only on nagios and not on nagflux, right? I have never looked deeper into Nagios so far, so I have no real clue. Like I said, the only connection between Nagflux and Nagios is the spoolfile folder and Livestatus. Regarding the spoolfile folder maybe the nagios write operation did not finish, I waited in an older version of Nagflux a few Seconds to avoid any errors but we didn't encounter any problems by removing this wait, here is an old Version of it: https://github.com/Griesbacher/nagflux/commit/b86c8d66cb3c7a8f40a4215da9b69d28ab40c003 If you can build Nagflux from Source you could try to build in this delay, maybe it helps. But that's also a shot in the dark, like I said we have never encountered such problems, so I've no setup to test such behaviour. Regarding Livestatus I have no clue... Philip

jwesterholt commented 8 years ago

Hello,

the load is completely on nagios.

A testsystem is easy to setup if you hav another nagios/omd instance running. I just created a new site and copied the old log files from "/opt/omd/sites/oldsite/var/nagios/archive/" to the new site ("/opt/omd/sites/newsite/var/nagios/archive/"). Do not forget to set the permissions to the site user. After this, start the instance and after a few minutes with nagflux enabled you see the nagios core taking the cpu.

The strace then shows the reading of the logs from the archive directory. After reading the logs the cpu load is slowly going down until it reads the log again after 2 minutes.

When using the older version b86c8d6 everything seems to be fine.

When you provide a patch for the current version i will test it here in my environment.

Edit: I just saw that in this version the timewait is already disabled. I will try to find the time to compile and test some other versions.

Jonathan

Griesbacher commented 8 years ago

We measured the cpu usage of nagios on a pretty big setup and it's not going up, within one hour. Core was nagios(omd). @jwesterholt Due to your hint of the archive and the two minutes I think the problem is not the spoolfile, maybe it's the livestatusquery. The reason for that is, nagflux querys livestatus every two minutes, which (I'm not totally sure) uses the archive to search for log entries. I'm currently using this query(The negation is for icinga2, because there was a bug I'll fix it by time): `GET log Columns: type time contact_name message Filter: type ~ .*NOTIFICATION Filter: time < %d Negate: OutputFormat: csv

` %d is the unixtimestamp two minutes ago. You could try to execute this query on livestatus and if your nagios cpu usage is also going up...

jwesterholt commented 8 years ago

Hello,

the problem is indeed the negation in the query. When using the query above i see the problem with the cpu load.

I have come to two solutions:

Increase the max_cached_messages of Livestatus (Search on https://mathias-kettner.de/checkmk_livestatus.html). This helps because we have a large environment and the test leaves us with around 10.000.000 Messages in the logfile since October 2015. I tested with max_cached_messages=20.000.000 and i also did not have any problem. The only problem is the memory usage (250 byte per cached message)
Use the following query: `GET log Columns: type time contact_name message Filter: type ~ .*NOTIFICATION Filter: time > %d OutputFormat: csv

`

Measurements from the system: OMD[test_nagflux]:~$ time unixcat < test.lql /opt/omd/sites/test_nagflux/tmp/run/live

real 0m30.946s user 0m0.000s sys 0m0.000s OMD[test_nagflux]:~$ time unixcat < test2.lql /opt/omd/sites/test_nagflux/tmp/run/live

real 0m0.003s user 0m0.000s sys 0m0.000s OMD[test_nagflux]:~$ diff -u test.lql test2.lql --- test.lql 2016-05-09 13:54:49.704681923 +0200 +++ test2.lql 2016-05-08 09:59:04.820232742 +0200 @@ -1,7 +1,6 @@ GET log Columns: type time contact_name message Filter: type ~ ._NOTIFICATION -Filter: time < 1462457555 -Negate: +Filter: time > 1462520867 OutputFormat: csv OMD[testnagflux]:~$ wc -l var/nagios/archive/|tail -1 10549067 total

I will compile the newest version of check_mk/nagflux without using negation.

@Griesbacher: What exactly was the reason to negate the query? Is the query abov the on you intended?

Griesbacher commented 8 years ago

@jwesterholt Thanks for testing! The query is that way because of a bug in Icinga2 Livestatus: https://dev.icinga.org/issues/10179 which is still not fixed... and therefore this was a workaround I found but I didn't thought it has that much impact on Nagios(and we never faced this kind of problem).

In my opinion it has to determined which version of Livestatus the system is using, to avoid such problems...

Griesbacher commented 8 years ago

@jwesterholt I added a little workaround in Nagflux, you could try it out by building it from source, if it's working I'll add it to OMD later.

jwesterholt commented 8 years ago

This seems to be working. With the old compilation after two minutes the load goes up. The new version is working fine.

Adding this version to Consol-OMD would be great as my script compiles it from there.

Griesbacher / nagflux

[question] using OMD with nagflux grafana influxdb #3