Open archiecobbs opened 8 years ago
Those lines just say the nagios process got an EAGAIN error while trying to read from the child processes, and had to retry. I will probably stop logging that in some future but want to leave it there for a while. I could change the log level to LOG_NOTICE instead of LOG_ERR, if that would help.
Thanks. It doesn't sound like this is really an error, so it shouldn't be logged as one, correct?
If in fact you are reading from a file descriptor marked for non-blocking, then EAGAIN
is completely normal and expected. Why does it need to be logged at all?
Some people still occasionally have problems, so I'm logging some info to help debug. It really should never have been LOG_ERR but I copied/pasted and didn't look close enough.
I've changed it to LOG_NOTICE in branch 'maint'.
I get these messages at every 10 mins, I guess one of my service check is causing this. Any idea how to find which check is causing this error.
Thanks.
@amgokak what version of Core are you using?
Nagios core 4.3.4
Are you also seeing a 'Failed to read()' line in the nagios.log during this time?
Also, do you know with iobroker method you're using? epoll
, poll
, or select
? Your configure output should give you a clue.
Yes I see below error Feb 16 14:46:21 testhost nagios: job 188 (pid=12588): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 181 (pid=12517): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 186 (pid=12606): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 189 (pid=12589): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 189 (pid=12589): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 189 (pid=12589): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 185 (pid=12481): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 197 (pid=12650): read() returned error 11 Feb 16 14:46:21 testhost nagios: job 185 (pid=12481): read() returned error 11
Not sure about iobroker method , where can I check it ? I have configured mod_gearman2 broker module
That is not the error I asked if you saw. I asked if you are also seeing a 'Failed to read()' error in the nagios.log during this same time. From the looks of what you posted, it doesn't seem that way.
Also as stated, you can tell from your configure output. In the directory you compiled nagios from source, there should be a file: lib/iobroker.h
Please run the following command to determine:
cat lib/iobroker.h | grep IOBROKER_USES
Sorry for late reply, I dont see any error like "Failed to read()" in nagios.log And regarding lib/iobroker.h , I cant find this file may be because I am using standalone Nagios core 4.3.4 and it is installed using rpm package.
@hedenface Which iobroker is the recommended to prevent the issue?
$ cat lib/iobroker.h | grep IOBROKER_USES
#define IOBROKER_USES_EPOLL 1
/* #undef IOBROKER_USES_POLL */
/* #undef IOBROKER_USES_SELECT */
@knweiss epoll is likely the best option. If someone is having the problem, and they recompile using
./configure --with-iobroker=poll
or
./configure --with-iobroker=select
And that changes the outcome, I would be very interested in hearing about it.
Having the same issue on different FreeBSD versions (10.4 32 bit and 9.3 64 bit) with nagios 4.2 from packages. But error in my case is 35:
System log example:
Jan 29 12:53:57 netfire nagios: SERVICE NOTIFICATION: nagios;Primergy;FreeRadius server;OK;notify-by-email;FreeRadius is UP
Jan 29 12:53:57 netfire nagios: SERVICE NOTIFICATION: smsadmin;Primergy;FreeRadius server;OK;notify-by-epager;FreeRadius is UP
Jan 29 12:53:57 netfire nagios: SERVICE ALERT: Primergy;FreeRadius server;OK;HARD;1;FreeRadius is UP
Jan 29 12:53:57 netfire nagios: job 193 (pid=39534): read() returned error 35
Jan 29 12:53:57 netfire nagios: job 193 (pid=39533): read() returned error 35
Jan 29 12:53:57 netfire postfix/pickup[26445]: 23ED371E3C5: uid=181 from=
There are no errors in nagios.log. I turned on debug for nagios but there is also nothing interesting. Nagios debug log for the same incident:
[1548759237.010345] [12288.1] [pid=36156] ## Polling 0ms; sockets=6; events=403; iobs=0x28815030 [1548759237.010959] [001.0] [pid=36156] handle_timed_event() start [1548759237.011005] [064.1] [pid=36156] Making callbacks (type 1)... [1548759237.011037] [008.0] [pid=36156] Timed Event Type: EVENT_SERVICE_CHECK, Run Time: Tue Jan 29 12:53:57 2019 [1548759237.011057] [008.0] [pid=36156] Service Check Event ==> Host: 'Primergy', Service: 'FreeRadius server', Options: 0, Latency: 0.008738 sec [1548759237.011079] [001.0] [pid=36156] run_scheduled_service_check() start [1548759237.011096] [016.0] [pid=36156] Attempting to run scheduled check of service 'FreeRadius server' on host 'Primergy': check options=0, latency=0.008738 [1548759237.011117] [001.0] [pid=36156] run_async_service_check() [1548759237.011134] [001.0] [pid=36156] check_service_check_viability() [1548759237.011153] [001.0] [pid=36156] check_time_against_period() [1548759237.011183] [001.0] [pid=36156] _get_matching_timerange() [1548759237.011214] [001.0] [pid=36156] check_service_dependencies() [1548759237.011233] [001.0] [pid=36156] check_service_parents() [1548759237.011256] [064.1] [pid=36156] Making callbacks (type 6)... [1548759237.011274] [016.0] [pid=36156] Checking service 'FreeRadius server' on host 'Primergy'... [1548759237.011331] [001.0] [pid=36156] get_raw_command_line_r() [1548759237.011350] [001.0] [pid=36156] process_macros_r() [1548759237.011367] [2048.1] [pid=36156] ** BEGIN MACRO PROCESSING * [1548759237.011383] [2048.1] [pid=36156] Processing: 'nagios' [1548759237.011403] [2048.1] [pid=36156] Done. Final output: 'nagios' [1548759237.011420] [2048.1] [pid=36156] * END MACRO PROCESSING [1548759237.011437] [001.0] [pid=36156] process_macros_r() [1548759237.011453] [2048.1] [pid=36156] BEGIN MACRO PROCESSING * [1548759237.011469] [2048.1] [pid=36156] Processing: 'nagios' [1548759237.011487] [2048.1] [pid=36156] Done. Final output: 'nagios' [1548759237.011504] [2048.1] [pid=36156] ** END MACRO PROCESSING * [1548759237.011520] [001.0] [pid=36156] process_macros_r() [1548759237.011537] [2048.1] [pid=36156] BEGIN MACRO PROCESSING * [1548759237.011553] [2048.1] [pid=36156] Processing: '/usr/local/etc/nagios/scripts/check_radius.sh $ARG1$ $ARG2$' [1548759237.011579] [2048.1] [pid=36156] Done. Final output: '/usr/local/etc/nagios/scripts/check_radius.sh nagios nagios' [1548759237.011597] [2048.1] [pid=36156] ** END MACRO PROCESSING * [1548759237.011621] [064.1] [pid=36156] Making callbacks (type 6)... [1548759237.011644] [16384.0] [pid=36156] create_job() [1548759237.011663] [16384.0] [pid=36156] get_worker() [1548759237.011688] [16384.1] [pid=36156] wproc_run_job() [1548759237.011713] [001.0] [pid=36156] macros_to_kvv() [1548759237.012886] [001.0] [pid=36156] clear_volatile_macros_r() [1548759237.012945] [064.1] [pid=36156] Making callbacks (type 1)... [1548759237.012969] [001.0] [pid=36156] handle_timed_event() end [1548759237.012990] [064.1] [pid=36156] Making callbacks (type 1)... [1548759237.013018] [008.1] [pid=36156] ** Event Check Loop [1548759237.013058] [008.1] [pid=36156] Next Event Time: Tue Jan 29 12:53:57 2019 [1548759237.013083] [008.1] [pid=36156] Current/Max Service Checks: 7/0 (inf% saturation)
And nagios.log:
[1548759237] SERVICE NOTIFICATION: nagios;Primergy;FreeRadius server;OK;notify-by-email;FreeRadius is UP [1548759237] SERVICE NOTIFICATION: smsadmin;Primergy;FreeRadius server;OK;notify-by-epager;FreeRadius is UP [1548759237] SERVICE ALERT: Primergy;FreeRadius server;OK;HARD;1;FreeRadius is UP
All services work as expected, but that job 193 (pid=39534): read() returned error 35
in /var/log/messages annoying.
Thanks!
nagios-plugins-2.2.1_6,1 Plugins for Nagios
nagios4-4.4.2,1 Powerful network monitoring system```
Can we reopen this issue, I'm facing this issue from latest version of Nagios v 4.4.3, I tried this options /configure --with-iobroker=poll & ./configure --with-iobroker=select even it didn't fix the issue. The changes recommended by @jfrickson in /lib/worker.c with LOG_NOTICE didn't fix the issue. I'm using RHEL 7 AWS Instance for nagios
Hello.
Same problem here:
/var/log/messages
Jun 19 08:11:00 localhost nagios: job 19317 (pid=11630): read() returned error 11
Jun 19 08:11:16 localhost nagios: job 19317 (pid=11763): read() returned error 11
Jun 19 08:11:17 localhost nagios: job 19317 (pid=11790): read() returned error 11
Jun 19 08:11:19 localhost nagios: job 19317 (pid=11796): read() returned error 11
Jun 19 08:11:19 localhost nagios: job 19317 (pid=11812): read() returned error 11
Jun 19 08:11:22 localhost nagios: job 19317 (pid=11819): read() returned error 11
Jun 19 08:11:24 localhost nagios: job 19318 (pid=11851): read() returned error 11
/var/log/nagios/nagios.log
[1560902414] External command error: Command failed
[1560902414] External command error: Command failed
[1560902415] External command error: Command failed
[1560902415] External command error: Command failed
[1560902416] External command error: Command failed
[1560902419] External command error: Command failed
[1560902420] External command error: Command failed
[1560902420] External command error: Command failed
CentOS 7, Nagios v4.4.3
@LHozzan Can I get more information about your system? Are hosts/services updating their status as expected? Do you have Mod Gearman installed? If so, which version?
@Madlohe
Can I get more information about your system?
Of course, everything you need. I glad help you solve this problem. Please ask, what you want know.
Are hosts/services updating their status as expected?
Sometimes no. We can see, that random service is in UNKNOW state, Nagios generate alert and after few moments is in OK state. We used Nagios only for passive checks delivered by NSCA (v2.7) and NSCA-NG (v1.6). I try NRDP, but this problem isnt solved.
Do you have Mod Gearman installed? If so, which version?
My apologize, dont know, what you mean.
I strongly recommend change these log strings. They are useless because at tht moment there are no these pids in a system and we have no idea which exactly commands failed. Please add there exactly command that returned that error. It will help to resolve it significantly.
Agreed. I've changed the log message and pushed into maint.
Ahoy folks,
Throwing my hat in the ring as also experiencing this issue on a Nagios XI 5.6.9 (large) installation.
I am also observing quantities of these read() returned error 11
messages in my service logs.
Looking back at the thread, I will try and anticipate at least some of the previously requested info to help support triage (below). Do let me know should you require further information or details.
OS: RHEL 7.7
APP: Nagios XI 5.6.9 w/ nagios core (4.4.5)
extras: no mod_gearman what-so-ever
monitors: ACTIVE && PASSIVE checks (via NRDP)
logs: systemd service (nagios
) writing these out to /var/log/messages
, did not observe the Failed to read()
error in /usr/local/nagios/var/nagios.log
.
Compiled vs package: No compilation on my part, all of this comes from packages included in the official Nagios repositories
Notes:
When I cross-checked my nagios.log
, I see this coincides with when a notification is to be sent out to recipients.
Looks like the alert did go out in the examples I looked at.
Error:
Jul 29 13:36:43 nagiosxi.fqdn nagios[112400]: job 1991 (pid=9018): read() returned error 11
Jul 29 13:36:43 nagiosxi.fqdn nagios[112400]: job 1991 (pid=9018): read() returned error 11
Jul 29 13:36:43 nagiosxi.fqdn nagios[112406]: job 1991 (pid=9026): read() returned error 11
Jul 29 13:38:43 nagiosxi.fqdn nagios[112400]: job 2224 (pid=12191): read() returned error 11
EDIT: Will report back after we upgrade to a more recent NagiosXI / Nagios CORE version (5.7.2 / 4.4.6).
Hello,
I noticed the same as @inversecow :
When I cross-checked my nagios.log, I see this coincides with when a notification is to be sent out to recipients. Looks like the alert did go out in the examples I looked at.
Mar 17 10:12:06 office nagios: job 499251 (pid=27436): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27449): read() returned error 11
Mar 17 10:12:06 office nagios: job 499249 (pid=27439): read() returned error 11
Mar 17 10:12:06 office nagios: job 499249 (pid=27439): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27436): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27444): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27449): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27436): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27444): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27444): read() returned error 11
Mar 17 10:12:06 office nagios: job 499249 (pid=27439): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27449): read() returned error 11
Mar 17 10:12:06 office nagios: job 499249 (pid=27439): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27449): read() returned error 11
Mar 17 10:12:06 office nagios: job 499251 (pid=27436): read() returned error 11
In /usr/local/nagios/var/nagios.log
cat /usr/local/nagios/var/nagios.log | perl -pe 's/(\d+)/localtime($1)/e'
Shows:
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION:
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION:
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION:
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION:
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION:
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION:
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION:
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION:
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION:
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION:
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION:
[Wed Mar 17 10:12:06 2021] SERVICE NOTIFICATION:
After upgrading from 4.1.1 -> 4.2.1, I'm seeing the same problem as described in this thread.