JasonRivers / Docker-Nagios

Docker image for Nagios
MIT License
232 stars 254 forks source link

Apache2 and NSCA not removing PID on reboot #139

Open AaronAutomation opened 2 years ago

AaronAutomation commented 2 years ago

Web access to nagios goes down after resetting my server. Logs show, httpd (pid 18) already running nsca[20039]: There's already an NSCA server running (PID 17). Bailing out... Removing those PIDs manually in the docker container fixes it until the next reboot.

AaronAutomation commented 2 years ago

Solved this issue and others I was having by rolling back to v4.4.4

tronyx commented 1 year ago

I believe the issues you were seeing were resolved in 4.4.8 which this image is currently using.

Innsai commented 1 year ago

4.4.8 still does: nsca[182]: There's already an NSCA server running (PID 33). Bailing out... Maybe a clue in the syslog: nsca[34]: Cannot remove pidfile '/var/run/nsca.pid' - check your privileges. @tronyx

gurubobnz commented 5 months ago

I see this issue from time to time.

nagios_1  | nsca[1727]: There's already an NSCA server running (PID 236).  Bailing out...
nagios_1  | nsca[1728]: There's already an NSCA server running (PID 236).  Bailing out...
nagios_1  | nsca[1729]: There's already an NSCA server running (PID 236).  Bailing out...

(repeated)

The nagios web UI was up and running, and in the container the /var/run/nsca.pid file was present and had a PID in it of the existing running process. I guess something is trying to launch another instance of NSCA and is failing with that message. Here's the PID file contents and currently running processes, including the /bin/bash as root that I used to get into the container.

root@68b427b3ea3f:/var/run# ls -la 
total 40
drwxr-xr-x 1 root   root   4096 May 19 10:26 .
drwxr-xr-x 1 root   root   4096 Jan 30 23:17 ..
drwxr-xr-x 1 root   root   4096 May 19 10:26 apache2
drwxrwxrwt 1 root   root   4096 Jan  5 22:46 lock
drwxr-xr-x 2 root   root   4096 Dec 12 03:04 mount
-rw-r--r-- 1 nagios nagios    4 May 18 20:42 nsca.pid
root@68b427b3ea3f:/var/run# cat nsca.pid 
236
root@68b427b3ea3f:/var/run# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   4356    40 ?        Ss   May19   0:00 /bin/bash /usr/local/bin/start_nagios
root       228  0.0  0.0   2804    28 ?        S    May19   0:10 runsvdir -P /etc/service
root       229  0.0  0.0   2652   320 ?        Ss   May19   0:00 runsv postfix
root       230  0.0  0.0   2652   308 ?        Ss   May19   0:00 runsv rsyslog
root       231  0.0  0.0   2652   328 ?        Ss   May19   0:00 runsv apache
root       232  0.0  0.0   2652   328 ?        Ss   May19   0:00 runsv nagios
root       233  0.0  0.0   2652   472 ?        Ss   May19  13:54 runsv nsca
root       234  0.0  0.0  41224   848 ?        S    May19   0:04 /usr/lib/postfix/sbin/master -d -c /etc/postfix
nagios     235  0.0  0.0  62680  2396 ?        S    May19   1:38 /opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg
root       236  0.0  0.0 206372   704 ?        Ss   May19   0:27 /usr/sbin/apache2 -D NO_DETACH
root       237  0.0  0.0 152428   844 ?        Sl   May19   2:28 rsyslogd -n -f /etc/rsyslog.conf
nagios     245  0.0  0.0  34540  1364 ?        S    May19   4:39 /opt/nagios/bin/nagios --worker /opt/nagios/var/rw/nagios.qh
nagios     246  0.0  0.0  34540  1352 ?        S    May19   5:11 /opt/nagios/bin/nagios --worker /opt/nagios/var/rw/nagios.qh
nagios     247  0.0  0.0  34540  1340 ?        S    May19   4:37 /opt/nagios/bin/nagios --worker /opt/nagios/var/rw/nagios.qh
nagios     248  0.0  0.0  34540  1352 ?        S    May19   5:12 /opt/nagios/bin/nagios --worker /opt/nagios/var/rw/nagios.qh
nagios     249  0.0  0.0  34540  1360 ?        S    May19   5:08 /opt/nagios/bin/nagios --worker /opt/nagios/var/rw/nagios.qh
nagios     250  0.0  0.0  34540  1276 ?        S    May19   4:34 /opt/nagios/bin/nagios --worker /opt/nagios/var/rw/nagios.qh
nagios     251  0.0  0.0  60936    40 ?        S    May19   0:42 /opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg
nagios     257  0.0  0.0 206580  3668 ?        S    May19   0:00 /usr/sbin/apache2 -D NO_DETACH
nagios     258  0.0  0.0 206596  3708 ?        S    May19   0:00 /usr/sbin/apache2 -D NO_DETACH
nagios     259  0.0  0.0 206580  3584 ?        S    May19   0:00 /usr/sbin/apache2 -D NO_DETACH
nagios     260  0.0  0.0 206580  3632 ?        S    May19   0:00 /usr/sbin/apache2 -D NO_DETACH
postfix    263  0.0  0.0  41364  1564 ?        S    May19   0:01 qmgr -l -t unix -d -u
nagios     643  0.0  0.1 206580  4208 ?        S    11:05   0:00 /usr/sbin/apache2 -D NO_DETACH
nagios     648  0.0  0.1 206580  4204 ?        S    11:05   0:00 /usr/sbin/apache2 -D NO_DETACH
nagios     650  0.0  0.1 206580  4104 ?        S    11:05   0:00 /usr/sbin/apache2 -D NO_DETACH
nagios     651  0.0  0.1 206604  4240 ?        S    11:05   0:00 /usr/sbin/apache2 -D NO_DETACH
nagios     685  0.0  0.1 206580  3928 ?        S    11:06   0:00 /usr/sbin/apache2 -D NO_DETACH
root      1752  0.0  0.0   4620  3836 pts/0    Ss   11:17   0:00 /bin/bash
root      1798  0.0  0.0   7056  1544 pts/0    R+   11:17   0:00 ps aux
nagios    4038  0.0  0.2 206580  8804 ?        S    May19   0:00 /usr/sbin/apache2 -D NO_DETACH
postfix  27212  0.0  0.1  41244  6440 ?        S    09:53   0:00 pickup -l -t unix -d -u -c

The dates on the PID (May 18) don't match with what I assume is the start time of the process (May 19). This might be a hint.

I removed the container and recreated it and this problem went away. I thought it might have been triggered by restarting the container, but restarting it worked fine. I wonder actually if this is caused by an unclean shutdown of the container, which would leave the PID file there, followed by a subsequent restart?

Version: latest, image hash 79a7fc3a2f88 (https://hub.docker.com/layers/jasonrivers/nagios/latest/images/sha256-a341182a89e6888c27cc283ca22e36b9f9ebd96deaa4b76063bdaeb8f025a16d?context=explore)