NagiosEnterprises / nrpe

NRPE Agent
GNU General Public License v2.0
259 stars 133 forks source link

after upgrade from 2.15-1 to 3.2.1-1~bpo9+1 on debian strech check_bind.sh plugin very slow #209

Open petersutty opened 5 years ago

petersutty commented 5 years ago

recently we upgraded to 3.2.1-1~bpo9+1 nrpe server and I see the check_nrpe calling check_bind.sh plugin is timing out, see following comparison:

nagios-nrpe-server 2.15-1 time /usr/lib/nagios/plugins/check_nrpe -H sp1 -c check_bind Bind9 is running. 8536 successfull requests, 0 referrals, 323 nxdomains since last check. | 'success'=8536 'referral'=0 'nxrrset'=2134 'nxdomain'=323 'recursion'=1136 'failure'=0 'duplicate'=0 'dropped'=0

real 0m0.198s user 0m0.004s sys 0m0.000s

nagios-nrpe-server 3.2.1-1~bpo9+1 time /usr/lib/nagios/plugins/check_nrpe -H sp2 -c check_bind CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

real 0m10.010s user 0m0.004s sys 0m0.000s

I see there is lot of changes between these two versions, but this might be related to the way (security) how the plugig scripts are called ?

I see it is hanging on check_bind..sh on this line: tac $path_stats/named.stats | awk '/--- ([0-9])/{p=1} p{print} /+++ ([0-9])/{p=0;if (count++==1) exit}' > $path_tmp/named.stats.tmp

ps -fe | grep nagios nagios 2998 1 0 08:50 ? 00:00:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -f nagios 19845 1 0 10:04 ? 00:00:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -f nagios 19846 19845 0 10:04 ? 00:00:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -f nagios 19847 19846 0 10:04 ? 00:00:00 sh -c /usr/lib/nagios/plugins/check_bind nagios 19848 19847 0 10:04 ? 00:00:00 /bin/sh /usr/lib/nagios/plugins/check_bind nagios 19854 19848 5 10:04 ? 00:00:01 tac /var/cache/bind/named.stats

while running check_bind.sh as user nagios (nrpe user) on monitored host is instant: sp2:~# su - nagios -c "time /usr/lib/nagios/plugins/check_bind" Bind9 is running. 5 successfull requests, 0 referrals, 0 nxdomains since last check. | 'success'=5 'referral'=0 'nxrrset'=0 'nxdomain'=0 'recursion'=3 'failure'=0 'duplicate'=0 'dropped'=0

real 0m0.143s user 0m0.012s sys 0m0.012s

sp2:~# su - nagios -c "time tac /var/cache/bind/named.stats | awk '/--- ([0-9])/{p=1} p{print} /+++ ([0-9])/{p=0;if (count++==1) exit}'" | grep Dump --- Statistics Dump --- (1557914774) +++ Statistics Dump +++ (1557914774) --- Statistics Dump --- (1557914767) +++ Statistics Dump +++ (1557914767)

real 0m0.005s user 0m0.000s sys 0m0.000s

box293 commented 5 years ago

I see it is hanging on check_bind..sh on this line: tac $pathstats/named.stats | awk '/--- ([0-9])/{p=1} p{print} /+++ ([0-9]_)/{p=0;if (count++==1) exit}' > $path_tmp/named.stats.tmp

Have you tried simplifying this line in the check_bind.sh script? Try removing the $path_stats and $path_tmp variables and replacing them with the actual variables.

Can you please post your nrpe.cfg file along with the check_bind command definition (if it's not defined in the config file).

Do you have the check_bind.sh script available for us to download?

petersutty commented 5 years ago

simplifying script did not do anything

nrpe.cfg pid_file=/var/run/nagios/nrpe.pid server_port=5666 allowed_hosts=10.99.4.11,10.99.4.90 nrpe_user=nagios nrpe_group=nagios dont_blame_nrpe=0 debug=0 include_dir=/etc/nagios/conf.d command[check_bind]=/usr/lib/nagios/plugins/check_bind

directory /etc/nagios/conf.d is empty

check_bind.sh attached, its the official from nagios plugin repositories

feels like: this operation is itself time consuming, as named.stats is 1.2G, but with awk is quick tac $path_stats/named.stats

so it looks like nrpe is doing the whole tac first and then awk - just a feeling

Note: check_bind.sh did not change and if I rollback to nrpe version 2.15 the check is instant, so must be something in 3.2.1

On Wed, May 15, 2019 at 11:58 PM Troy Lea notifications@github.com wrote:

I see it is hanging on check_bind..sh on this line: tac $path_stats/named.stats | awk '/--- ([0-9])/{p=1} p{print} /+++ ([0-9])/{p=0;if (count++==1) exit}' > $path_tmp/named.stats.tmp

Have you tried simplifying this line in the check_bind.sh script? Try removing the $path_stats and $path_tmp variables and replacing them with the actual variables.

Can you please post your nrpe.cfg file along with the check_bind command definition (if it's not defined in the config file).

Do you have the check_bind.sh script available for us to download?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NagiosEnterprises/nrpe/issues/209?email_source=notifications&email_token=AA5WSLPNT6FNT47DGEJYBSDPVSBONA5CNFSM4HNBXEZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVQCAUI#issuecomment-492838993, or mute the thread https://github.com/notifications/unsubscribe-auth/AA5WSLJM4RNT4V6BJRXT2BDPVSBONANCNFSM4HNBXEZA .