Open AaronAutomation opened 2 years ago
Solved this issue and others I was having by rolling back to v4.4.4
I believe the issues you were seeing were resolved in 4.4.8
which this image is currently using.
4.4.8 still does: nsca[182]: There's already an NSCA server running (PID 33). Bailing out... Maybe a clue in the syslog: nsca[34]: Cannot remove pidfile '/var/run/nsca.pid' - check your privileges. @tronyx
I see this issue from time to time.
nagios_1 | nsca[1727]: There's already an NSCA server running (PID 236). Bailing out...
nagios_1 | nsca[1728]: There's already an NSCA server running (PID 236). Bailing out...
nagios_1 | nsca[1729]: There's already an NSCA server running (PID 236). Bailing out...
(repeated)
The nagios web UI was up and running, and in the container the /var/run/nsca.pid file was present and had a PID in it of the existing running process. I guess something is trying to launch another instance of NSCA and is failing with that message. Here's the PID file contents and currently running processes, including the /bin/bash as root that I used to get into the container.
root@68b427b3ea3f:/var/run# ls -la
total 40
drwxr-xr-x 1 root root 4096 May 19 10:26 .
drwxr-xr-x 1 root root 4096 Jan 30 23:17 ..
drwxr-xr-x 1 root root 4096 May 19 10:26 apache2
drwxrwxrwt 1 root root 4096 Jan 5 22:46 lock
drwxr-xr-x 2 root root 4096 Dec 12 03:04 mount
-rw-r--r-- 1 nagios nagios 4 May 18 20:42 nsca.pid
root@68b427b3ea3f:/var/run# cat nsca.pid
236
root@68b427b3ea3f:/var/run# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 4356 40 ? Ss May19 0:00 /bin/bash /usr/local/bin/start_nagios
root 228 0.0 0.0 2804 28 ? S May19 0:10 runsvdir -P /etc/service
root 229 0.0 0.0 2652 320 ? Ss May19 0:00 runsv postfix
root 230 0.0 0.0 2652 308 ? Ss May19 0:00 runsv rsyslog
root 231 0.0 0.0 2652 328 ? Ss May19 0:00 runsv apache
root 232 0.0 0.0 2652 328 ? Ss May19 0:00 runsv nagios
root 233 0.0 0.0 2652 472 ? Ss May19 13:54 runsv nsca
root 234 0.0 0.0 41224 848 ? S May19 0:04 /usr/lib/postfix/sbin/master -d -c /etc/postfix
nagios 235 0.0 0.0 62680 2396 ? S May19 1:38 /opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg
root 236 0.0 0.0 206372 704 ? Ss May19 0:27 /usr/sbin/apache2 -D NO_DETACH
root 237 0.0 0.0 152428 844 ? Sl May19 2:28 rsyslogd -n -f /etc/rsyslog.conf
nagios 245 0.0 0.0 34540 1364 ? S May19 4:39 /opt/nagios/bin/nagios --worker /opt/nagios/var/rw/nagios.qh
nagios 246 0.0 0.0 34540 1352 ? S May19 5:11 /opt/nagios/bin/nagios --worker /opt/nagios/var/rw/nagios.qh
nagios 247 0.0 0.0 34540 1340 ? S May19 4:37 /opt/nagios/bin/nagios --worker /opt/nagios/var/rw/nagios.qh
nagios 248 0.0 0.0 34540 1352 ? S May19 5:12 /opt/nagios/bin/nagios --worker /opt/nagios/var/rw/nagios.qh
nagios 249 0.0 0.0 34540 1360 ? S May19 5:08 /opt/nagios/bin/nagios --worker /opt/nagios/var/rw/nagios.qh
nagios 250 0.0 0.0 34540 1276 ? S May19 4:34 /opt/nagios/bin/nagios --worker /opt/nagios/var/rw/nagios.qh
nagios 251 0.0 0.0 60936 40 ? S May19 0:42 /opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg
nagios 257 0.0 0.0 206580 3668 ? S May19 0:00 /usr/sbin/apache2 -D NO_DETACH
nagios 258 0.0 0.0 206596 3708 ? S May19 0:00 /usr/sbin/apache2 -D NO_DETACH
nagios 259 0.0 0.0 206580 3584 ? S May19 0:00 /usr/sbin/apache2 -D NO_DETACH
nagios 260 0.0 0.0 206580 3632 ? S May19 0:00 /usr/sbin/apache2 -D NO_DETACH
postfix 263 0.0 0.0 41364 1564 ? S May19 0:01 qmgr -l -t unix -d -u
nagios 643 0.0 0.1 206580 4208 ? S 11:05 0:00 /usr/sbin/apache2 -D NO_DETACH
nagios 648 0.0 0.1 206580 4204 ? S 11:05 0:00 /usr/sbin/apache2 -D NO_DETACH
nagios 650 0.0 0.1 206580 4104 ? S 11:05 0:00 /usr/sbin/apache2 -D NO_DETACH
nagios 651 0.0 0.1 206604 4240 ? S 11:05 0:00 /usr/sbin/apache2 -D NO_DETACH
nagios 685 0.0 0.1 206580 3928 ? S 11:06 0:00 /usr/sbin/apache2 -D NO_DETACH
root 1752 0.0 0.0 4620 3836 pts/0 Ss 11:17 0:00 /bin/bash
root 1798 0.0 0.0 7056 1544 pts/0 R+ 11:17 0:00 ps aux
nagios 4038 0.0 0.2 206580 8804 ? S May19 0:00 /usr/sbin/apache2 -D NO_DETACH
postfix 27212 0.0 0.1 41244 6440 ? S 09:53 0:00 pickup -l -t unix -d -u -c
The dates on the PID (May 18) don't match with what I assume is the start time of the process (May 19). This might be a hint.
I removed the container and recreated it and this problem went away. I thought it might have been triggered by restarting the container, but restarting it worked fine. I wonder actually if this is caused by an unclean shutdown of the container, which would leave the PID file there, followed by a subsequent restart?
Version: latest, image hash 79a7fc3a2f88 (https://hub.docker.com/layers/jasonrivers/nagios/latest/images/sha256-a341182a89e6888c27cc283ca22e36b9f9ebd96deaa4b76063bdaeb8f025a16d?context=explore)
Web access to nagios goes down after resetting my server. Logs show, httpd (pid 18) already running nsca[20039]: There's already an NSCA server running (PID 17). Bailing out... Removing those PIDs manually in the docker container fixes it until the next reboot.