NagiosEnterprises / nagioscore

Nagios Core
GNU General Public License v2.0
1.48k stars 440 forks source link

Nagios causes high load average and high CPU consuming #934

Closed filosof86 closed 5 months ago

filosof86 commented 7 months ago

Hi all,

The Nagios v4.4.9 I have noticed the issue described in the subject. The Nagios highly consumes CPU and there is no obvious reason (or I didn't find it) for that.

There are a lot of messages like the following in the logs:

[1697545504] wproc: SERVICE PERFDATA job 2414 from worker Core Worker 986399 timed out after 6.48s
[1697545504] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1697545504] wproc: Core Worker 986399: job 2411 (pid=1030886): Dormant child reaped
[1697545504] wproc: Core Worker 986399: job 2415 (pid=1030915) timed out. Killing it
[1697545504] wproc: SERVICE PERFDATA job 2415 from worker Core Worker 986399 timed out after 6.53s
[1697545504] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1697545504] wproc: Core Worker 986399: job 2416 (pid=1030919) timed out. Killing it
[1697545504] wproc: SERVICE PERFDATA job 2416 from worker Core Worker 986399 timed out after 6.73s

As I can see there is indeed something wrong with perfdata processing. If we disable perdata processing via Nagios config process_performance_data=0 Nagios starts working normally.

I tried to remove the custom perf data processor and just configure the perfdata command like the following:

define command {
         command_name                  process-service-perfdata
         command_line                  /bin/true
         }

But it doesn't help. I get the impression that it's something lying under the hood of the process-service-perfdata functionality and it doesn't matter what script/command/whatever it launches.

I've tried to enable Nagios debug, and made some 'strace' tests however, logs say that there is nothing unusual or incorrect. Just usual performing.

However, it appears the performance data processing noticeably slows down the Nagios and makes Nagios devour the system resources.

Could you please help me to sort that out?

P.S. I understand that the version I use is not the last one, but before updating (which is not that simple process in my case) I need to make sure whether it is something that is fixed in the last version or we're hitting another issue.

Many thanks in advance!

everwatch commented 7 months ago

What distribution/version are you running? It looks like it's trying to process a lot of old performance data and the reaper is coming along and killing the process before it can finish. Which just makes for more old performance data and this process repeats. When was the last time it worked properly and what did you change that made it not work properly?

filosof86 commented 7 months ago

Hi @everwatch

What distribution/version are you running?

Dist: Oracle Linux 8

When was the last time it worked properly and what did you change that made it not work properly?

According to the reports I've been given, it worked properly on Nagios 3.0.6, issues started appearing after the update to Nagios v4 (4.4.9)

It looks like it's trying to process a lot of old performance data and the reaper is coming along and killing the process before it can finish.

Yes, but it didn't happen before and the data amount hasn't changed since then. In addition, AFAIU, performance data processing should have been improved in terms of implementing parallel processes, etc in Nagios 4x

filosof86 commented 7 months ago

Hi,

It looks like something is also incorrect with the custom perfdata processing script. I'll re-check that and get back to you. Thank you.

filosof86 commented 6 months ago

Hi

An update.

With the Processing Performance Data Using Commands method, I had been getting a big load average and weird results when different commands/scripts could spark the timeout/LA issues (or could work OK). Unfortunately, I cannot say for sure what was the reason.

Thus, I ended up setting the perfdata to be processed via files (Writing Performance Data To Files method). After that, the LA seems to get back to normal, and timeout issues have gone.

I believe this issue can be closed for now. Thank you.