NagiosEnterprises / nrpe

NRPE Agent
GNU General Public License v2.0
257 stars 133 forks source link

NRPE master high load issue #244

Open younity-ENG opened 3 years ago

younity-ENG commented 3 years ago

hi all

im using nagios core 4.4.3 with nagios-nrpe-plugin 3.2.1 installed on ubuntu 18.04. it installed on AWS EC2 type t2.medium (2cpu, 4ram). my server is configured with 3 check_workers due to my 2 CPUs. it servers as "on-site" with direct host/service checks via VPN and as an NRPE master server. the external commands are mostly ping and around 4 http/dns checks. around 100 direct services and 350 NRPE services (one host) when adding more NRPE agents (400 services each) the master load is rising and I'm getting "localhost load" alerts localhost/Current Load is CRITICAL: CRITICAL - load average: 1.31, 1.48, 4.01

while monitoring the server with Htop I see that the CPU uses repeatedly reaches to 100%. I've looked online and found some recommendations that didn't really help.

using Htop i see the CPU spikes accrues when external commands are executed.

does anyone have any idea why my CPU is so high? shouldn't Nagios handle thousands of services (with the right configuration) . ill appreciate any tips and recommendations.

thanks

younity-ENG commented 3 years ago

hi ant ideas regarding this issue?

thanks

sawolf commented 3 years ago

Hi, thanks for reporting this. Can you elaborate on your current system architecture?

It sounds to me like you're saying you have

I guess my question is - how many of these agents are you adding before you see the CPU load increase?

Also, I recommend increasing the number of check_workers, since those will block on network requests. It may not affect anything, but if any of these plugins take a long time to execute, the worker will just be sleeping for that whole time.

If you're adding a lot of these agents (so that you have 5000+ services), you might want to look into something like mod_gearman to distribute the work being done.

younity-ENG commented 3 years ago

hi Sebastian

thank you for responding. basically the average load is getting high since the second agent. i did increase my HW resources and the number of workers (4cores and 6 workers) but it didn't really help. i understand you recommend trying the mod_gearman for this scale. im not familiar with this module. dose it mean that the remote agent will be mod_gearman and not NRPE?

thanks

younity-ENG commented 3 years ago

are you familiar with NRDP?