NagiosEnterprises / nagioscore

Nagios Core
GNU General Public License v2.0
1.56k stars 448 forks source link

nagios: job XX (pid=YY): read() returned error 11 #172

Open archiecobbs opened 8 years ago

archiecobbs commented 8 years ago

After upgrading from 4.1.1 -> 4.2.1, I'm seeing the same problem as described in this thread.

Sep 26 04:21:58 ops nagios: job 508 (pid=26756): read() returned error 11
Sep 26 04:36:58 ops nagios: job 523 (pid=27286): read() returned error 11
Sep 26 04:51:58 ops nagios: job 538 (pid=27801): read() returned error 11
Sep 26 05:06:58 ops nagios: job 553 (pid=28300): read() returned error 11
Sep 26 05:21:58 ops nagios: job 568 (pid=28797): read() returned error 11
Sep 26 05:36:58 ops nagios: job 585 (pid=29372): read() returned error 11
Sep 26 05:51:58 ops nagios: job 599 (pid=29857): read() returned error 11
Sep 26 06:06:58 ops nagios: job 615 (pid=30380): read() returned error 11
Sep 26 06:21:58 ops nagios: job 630 (pid=30902): read() returned error 11
Sep 26 06:36:58 ops nagios: job 645 (pid=31451): read() returned error 11
Sep 26 06:51:58 ops nagios: job 660 (pid=31945): read() returned error 11
Sep 26 07:06:58 ops nagios: job 675 (pid=32456): read() returned error 11
Sep 26 07:21:58 ops nagios: job 691 (pid=541): read() returned error 11
Sep 26 07:36:58 ops nagios: job 706 (pid=1204): read() returned error 11
Sep 26 07:51:58 ops nagios: job 721 (pid=1772): read() returned error 11
Sep 26 08:06:58 ops nagios: job 736 (pid=2586): read() returned error 11
Sep 26 08:21:58 ops nagios: job 752 (pid=3126): read() returned error 11
jfrickson commented 8 years ago

Those lines just say the nagios process got an EAGAIN error while trying to read from the child processes, and had to retry. I will probably stop logging that in some future but want to leave it there for a while. I could change the log level to LOG_NOTICE instead of LOG_ERR, if that would help.

archiecobbs commented 8 years ago

Thanks. It doesn't sound like this is really an error, so it shouldn't be logged as one, correct?

If in fact you are reading from a file descriptor marked for non-blocking, then EAGAIN is completely normal and expected. Why does it need to be logged at all?

jfrickson commented 8 years ago

Some people still occasionally have problems, so I'm logging some info to help debug. It really should never have been LOG_ERR but I copied/pasted and didn't look close enough.

I've changed it to LOG_NOTICE in branch 'maint'.

amgokak commented 6 years ago

I get these messages at every 10 mins, I guess one of my service check is causing this. Any idea how to find which check is causing this error.

Thanks.

hedenface commented 6 years ago

@amgokak what version of Core are you using?

amgokak commented 6 years ago

Nagios core 4.3.4

hedenface commented 6 years ago

Are you also seeing a 'Failed to read()' line in the nagios.log during this time?

hedenface commented 6 years ago

Also, do you know with iobroker method you're using? epoll, poll, or select? Your configure output should give you a clue.

amgokak commented 6 years ago

Yes I see below error Feb 16 14:46:21 testhost nagios: job 188 (pid=12588): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 181 (pid=12517): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 186 (pid=12606): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 189 (pid=12589): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 189 (pid=12589): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 189 (pid=12589): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 185 (pid=12481): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 197 (pid=12650): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 185 (pid=12481): read() returned error 11

Not sure about iobroker method , where can I check it ? I have configured mod_gearman2 broker module

hedenface commented 6 years ago

That is not the error I asked if you saw. I asked if you are also seeing a 'Failed to read()' error in the nagios.log during this same time. From the looks of what you posted, it doesn't seem that way.

Also as stated, you can tell from your configure output. In the directory you compiled nagios from source, there should be a file: lib/iobroker.h

Please run the following command to determine:

cat lib/iobroker.h | grep IOBROKER_USES
amgokak commented 6 years ago

Sorry for late reply, I dont see any error like "Failed to read()" in nagios.log And regarding lib/iobroker.h , I cant find this file may be because I am using standalone Nagios core 4.3.4 and it is installed using rpm package.

knweiss commented 6 years ago

@hedenface Which iobroker is the recommended to prevent the issue?

$ cat lib/iobroker.h | grep IOBROKER_USES
#define IOBROKER_USES_EPOLL 1
/* #undef IOBROKER_USES_POLL */
/* #undef IOBROKER_USES_SELECT */
hedenface commented 6 years ago

@knweiss epoll is likely the best option. If someone is having the problem, and they recompile using

./configure --with-iobroker=poll

or

./configure --with-iobroker=select

And that changes the outcome, I would be very interested in hearing about it.

vkakhnych commented 5 years ago

Having the same issue on different FreeBSD versions (10.4 32 bit and 9.3 64 bit) with nagios 4.2 from packages. But error in my case is 35:

System log example:

Jan 29 12:53:57 netfire nagios: SERVICE NOTIFICATION: nagios;Primergy;FreeRadius server;OK;notify-by-email;FreeRadius is UP Jan 29 12:53:57 netfire nagios: SERVICE NOTIFICATION: smsadmin;Primergy;FreeRadius server;OK;notify-by-epager;FreeRadius is UP Jan 29 12:53:57 netfire nagios: SERVICE ALERT: Primergy;FreeRadius server;OK;HARD;1;FreeRadius is UP Jan 29 12:53:57 netfire nagios: job 193 (pid=39534): read() returned error 35 Jan 29 12:53:57 netfire nagios: job 193 (pid=39533): read() returned error 35 Jan 29 12:53:57 netfire postfix/pickup[26445]: 23ED371E3C5: uid=181 from= Jan 29 12:53:57 netfire postfix/cleanup[39543]: 23ED371E3C5: message-id=20190129105357.23ED371E3C5@netfire Jan 29 12:53:57 netfire postfix/qmgr[16777]: 23ED371E3C5: from=nagios@netfire, size=491, nrcpt=1 (queue active) Jan 29 12:53:57 netfire postfix/pickup[26445]: 2971A71E3C6: uid=181 from= Jan 29 12:53:57 netfire postfix/cleanup[39543]: 2971A71E3C6: message-id=20190129105357.2971A71E3C6@netfire Jan 29 12:53:57 netfire postfix/qmgr[16777]: 2971A71E3C6: from=nagios@netfire, size=430, nrcpt=1 (queue active) Jan 29 12:53:57 netfire postfix/smtp[39547]: 23ED371E3C5: to=nagiosadmin@my.net, relay=mail.my.net[192.168.1.2]:25, delay=0.19, delays=0.05/0.02/0/0.12, Jan 29 12:53:57 netfire postfix/qmgr[16777]: 23ED371E3C5: removed Jan 29 12:53:57 netfire postfix/smtp[39548]: 2971A71E3C6: to=mail2sms@my.net, relay=mail.my.net[192.168.1.2]:25, delay=0.27, delays=0.07/0.02/0/0.1 Jan 29 12:53:57 netfire postfix/qmgr[16777]: 2971A71E3C6: removed

There are no errors in nagios.log. I turned on debug for nagios but there is also nothing interesting. Nagios debug log for the same incident:

[1548759237.010345] [12288.1] [pid=36156] ## Polling 0ms; sockets=6; events=403; iobs=0x28815030 [1548759237.010959] [001.0] [pid=36156] handle_timed_event() start [1548759237.011005] [064.1] [pid=36156] Making callbacks (type 1)... [1548759237.011037] [008.0] [pid=36156] Timed Event Type: EVENT_SERVICE_CHECK, Run Time: Tue Jan 29 12:53:57 2019 [1548759237.011057] [008.0] [pid=36156] Service Check Event ==> Host: 'Primergy', Service: 'FreeRadius server', Options: 0, Latency: 0.008738 sec [1548759237.011079] [001.0] [pid=36156] run_scheduled_service_check() start [1548759237.011096] [016.0] [pid=36156] Attempting to run scheduled check of service 'FreeRadius server' on host 'Primergy': check options=0, latency=0.008738 [1548759237.011117] [001.0] [pid=36156] run_async_service_check() [1548759237.011134] [001.0] [pid=36156] check_service_check_viability() [1548759237.011153] [001.0] [pid=36156] check_time_against_period() [1548759237.011183] [001.0] [pid=36156] _get_matching_timerange() [1548759237.011214] [001.0] [pid=36156] check_service_dependencies() [1548759237.011233] [001.0] [pid=36156] check_service_parents() [1548759237.011256] [064.1] [pid=36156] Making callbacks (type 6)... [1548759237.011274] [016.0] [pid=36156] Checking service 'FreeRadius server' on host 'Primergy'... [1548759237.011331] [001.0] [pid=36156] get_raw_command_line_r() [1548759237.011350] [001.0] [pid=36156] process_macros_r() [1548759237.011367] [2048.1] [pid=36156] ** BEGIN MACRO PROCESSING * [1548759237.011383] [2048.1] [pid=36156] Processing: 'nagios' [1548759237.011403] [2048.1] [pid=36156] Done. Final output: 'nagios' [1548759237.011420] [2048.1] [pid=36156] * END MACRO PROCESSING [1548759237.011437] [001.0] [pid=36156] process_macros_r() [1548759237.011453] [2048.1] [pid=36156] BEGIN MACRO PROCESSING * [1548759237.011469] [2048.1] [pid=36156] Processing: 'nagios' [1548759237.011487] [2048.1] [pid=36156] Done. Final output: 'nagios' [1548759237.011504] [2048.1] [pid=36156] ** END MACRO PROCESSING * [1548759237.011520] [001.0] [pid=36156] process_macros_r() [1548759237.011537] [2048.1] [pid=36156] BEGIN MACRO PROCESSING * [1548759237.011553] [2048.1] [pid=36156] Processing: '/usr/local/etc/nagios/scripts/check_radius.sh $ARG1$ $ARG2$' [1548759237.011579] [2048.1] [pid=36156] Done. Final output: '/usr/local/etc/nagios/scripts/check_radius.sh nagios nagios' [1548759237.011597] [2048.1] [pid=36156] ** END MACRO PROCESSING * [1548759237.011621] [064.1] [pid=36156] Making callbacks (type 6)... [1548759237.011644] [16384.0] [pid=36156] create_job() [1548759237.011663] [16384.0] [pid=36156] get_worker() [1548759237.011688] [16384.1] [pid=36156] wproc_run_job() [1548759237.011713] [001.0] [pid=36156] macros_to_kvv() [1548759237.012886] [001.0] [pid=36156] clear_volatile_macros_r() [1548759237.012945] [064.1] [pid=36156] Making callbacks (type 1)... [1548759237.012969] [001.0] [pid=36156] handle_timed_event() end [1548759237.012990] [064.1] [pid=36156] Making callbacks (type 1)... [1548759237.013018] [008.1] [pid=36156] ** Event Check Loop [1548759237.013058] [008.1] [pid=36156] Next Event Time: Tue Jan 29 12:53:57 2019 [1548759237.013083] [008.1] [pid=36156] Current/Max Service Checks: 7/0 (inf% saturation)

And nagios.log:

[1548759237] SERVICE NOTIFICATION: nagios;Primergy;FreeRadius server;OK;notify-by-email;FreeRadius is UP [1548759237] SERVICE NOTIFICATION: smsadmin;Primergy;FreeRadius server;OK;notify-by-epager;FreeRadius is UP [1548759237] SERVICE ALERT: Primergy;FreeRadius server;OK;HARD;1;FreeRadius is UP

All services work as expected, but that job 193 (pid=39534): read() returned error 35 in /var/log/messages annoying.

Thanks!

vkakhnych commented 5 years ago

nagios-plugins-2.2.1_6,1       Plugins for Nagios
nagios4-4.4.2,1                Powerful network monitoring system```
rosri1992 commented 5 years ago

Can we reopen this issue, I'm facing this issue from latest version of Nagios v 4.4.3, I tried this options /configure --with-iobroker=poll & ./configure --with-iobroker=select even it didn't fix the issue. The changes recommended by @jfrickson in /lib/worker.c with LOG_NOTICE didn't fix the issue. I'm using RHEL 7 AWS Instance for nagios

LHozzan commented 5 years ago

Hello. Same problem here: /var/log/messages

Jun 19 08:11:00 localhost nagios: job 19317 (pid=11630): read() returned error 11
Jun 19 08:11:16 localhost nagios: job 19317 (pid=11763): read() returned error 11
Jun 19 08:11:17 localhost nagios: job 19317 (pid=11790): read() returned error 11
Jun 19 08:11:19 localhost nagios: job 19317 (pid=11796): read() returned error 11
Jun 19 08:11:19 localhost nagios: job 19317 (pid=11812): read() returned error 11
Jun 19 08:11:22 localhost nagios: job 19317 (pid=11819): read() returned error 11
Jun 19 08:11:24 localhost nagios: job 19318 (pid=11851): read() returned error 11

/var/log/nagios/nagios.log

[1560902414] External command error: Command failed
[1560902414] External command error: Command failed
[1560902415] External command error: Command failed
[1560902415] External command error: Command failed
[1560902416] External command error: Command failed
[1560902419] External command error: Command failed
[1560902420] External command error: Command failed
[1560902420] External command error: Command failed

CentOS 7, Nagios v4.4.3

sawolf commented 5 years ago

@LHozzan Can I get more information about your system? Are hosts/services updating their status as expected? Do you have Mod Gearman installed? If so, which version?

LHozzan commented 5 years ago

@Madlohe

Can I get more information about your system?

Of course, everything you need. I glad help you solve this problem. Please ask, what you want know.

Are hosts/services updating their status as expected?

Sometimes no. We can see, that random service is in UNKNOW state, Nagios generate alert and after few moments is in OK state. We used Nagios only for passive checks delivered by NSCA (v2.7) and NSCA-NG (v1.6). I try NRDP, but this problem isnt solved.

Do you have Mod Gearman installed? If so, which version?

My apologize, dont know, what you mean.

vkakhnych commented 5 years ago

I strongly recommend change these log strings. They are useless because at tht moment there are no these pids in a system and we have no idea which exactly commands failed. Please add there exactly command that returned that error. It will help to resolve it significantly.

sawolf commented 5 years ago

Agreed. I've changed the log message and pushed into maint.

inversecow commented 4 years ago

Ahoy folks,

Throwing my hat in the ring as also experiencing this issue on a Nagios XI 5.6.9 (large) installation. I am also observing quantities of these read() returned error 11 messages in my service logs.

Looking back at the thread, I will try and anticipate at least some of the previously requested info to help support triage (below). Do let me know should you require further information or details.

OS: RHEL 7.7 APP: Nagios XI 5.6.9 w/ nagios core (4.4.5) extras: no mod_gearman what-so-ever monitors: ACTIVE && PASSIVE checks (via NRDP) logs: systemd service (nagios) writing these out to /var/log/messages, did not observe the Failed to read() error in /usr/local/nagios/var/nagios.log. Compiled vs package: No compilation on my part, all of this comes from packages included in the official Nagios repositories

Notes: When I cross-checked my nagios.log, I see this coincides with when a notification is to be sent out to recipients. Looks like the alert did go out in the examples I looked at.

Error:

Jul 29 13:36:43 nagiosxi.fqdn nagios[112400]: job 1991 (pid=9018): read() returned error 11
Jul 29 13:36:43 nagiosxi.fqdn nagios[112400]: job 1991 (pid=9018): read() returned error 11
Jul 29 13:36:43 nagiosxi.fqdn nagios[112406]: job 1991 (pid=9026): read() returned error 11
Jul 29 13:38:43 nagiosxi.fqdn nagios[112400]: job 2224 (pid=12191): read() returned error 11

EDIT: Will report back after we upgrade to a more recent NagiosXI / Nagios CORE version (5.7.2 / 4.4.6).

Vudubond commented 3 years ago

Hello,

I noticed the same as @inversecow :

When I cross-checked my nagios.log, I see this coincides with when a notification is to be sent out to recipients. Looks like the alert did go out in the examples I looked at.

Mar 17 10:12:06 office nagios: job 499251 (pid=27436): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27449): read() returned error 11
Mar 17 10:12:06 office nagios: job 499249 (pid=27439): read() returned error 11
Mar 17 10:12:06 office nagios: job 499249 (pid=27439): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27436): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27444): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27449): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27436): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27444): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27444): read() returned error 11
Mar 17 10:12:06 office nagios: job 499249 (pid=27439): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27449): read() returned error 11
Mar 17 10:12:06 office nagios: job 499249 (pid=27439): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27449): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27436): read() returned error 11

In /usr/local/nagios/var/nagios.log

cat /usr/local/nagios/var/nagios.log | perl -pe 's/(\d+)/localtime($1)/e'

Shows:

[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION: 
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION: 
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION: 
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION: 
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION: 
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION: 
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION: 
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION: 
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION: 
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION: 
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION: 
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION: