I'm investigating an odd and difficult to recreate problem with Apache using prefork MPM, and it seems to only happen inside Docker when using systemctl.
The two main issues I've observed are:
1) When first starting up, Apache will not fork any new children beyond its StartServers + MinSpareServers setting. Also sometimes, it will do one fork event and then stop there and not fork any new children after that
2) When Apache shuts down children via MaxRequestsPerChild to cycle through them, the children become zombies, but still account for an idle slot. Eventually the zombies suck up all of the slots and DoS the whole server
Since both of these are intermittent problems, it's really frustrating to isolate and debug. The best chance I can give to recreate this is:
Dockerfile
# syntax=docker/dockerfile:1.3-labs
FROM centos:centos7
RUN yum install -y httpd
RUN curl https://raw.githubusercontent.com/gdraheim/docker-systemctl-replacement/master/files/docker/systemctl.py > /usr/bin/systemctl \
&& systemctl enable httpd
COPY --chmod=755 <<EOF /var/www/cgi-bin/sleeper.cgi
#!/bin/bash
/bin/sleep 0.2
echo Content-Type: text-plain
echo
echo Hello World
EOF
COPY <<EOF /etc/httpd/conf.d/extra-config.conf
ExtendedStatus on
<Location /server-status>
SetHandler server-status
Order allow,deny
Deny from none
Allow from all
</Location>
StartServers 2
MinSpareServers 5
MaxSpareServers 20
ServerLimit 2048
MaxClients 2048
MaxRequestWorkers 2048
MaxRequestsPerChild 10
EOF
# Uncomment this to recreate the issue
CMD ["/usr/bin/systemctl", "-vvv"]
# Uncomment this to see it work fine
#STOPSIGNAL SIGWINCH
#CMD ["/usr/sbin/httpd", "-DFOREGROUND"]
On the client side, I was using something like this to recreate the problem with the best chance:
ab -n 1000000 -c 64 http://localhost:8081/cgi-bin/sleeper.cgi
No keep-alive requests, and hammer on it after startup. You can see it happen more slowly with 8 concurrency, and it takes a few minutes before the zombies build up and DoS the server.
After things are locked up, the process table looks like this:
PID TTY STAT TIME COMMAND
1 ? Ss 0:00 /usr/bin/python2 /usr/bin/systemctl -vvv
8 ? Ss 0:00 /usr/sbin/httpd -DFOREGROUND
625 ? Z 0:00 [httpd] <defunct>
1808 ? Z 0:00 [httpd] <defunct>
1809 ? Z 0:00 [httpd] <defunct>
1811 ? Z 0:00 [httpd] <defunct>
1814 ? Z 0:00 [httpd] <defunct>
1815 ? Z 0:00 [httpd] <defunct>
1821 ? Z 0:00 [httpd] <defunct>
1822 ? Z 0:00 [httpd] <defunct>
1823 ? Z 0:00 [httpd] <defunct>
1828 ? Z 0:00 [httpd] <defunct>
1832 ? Z 0:00 [httpd] <defunct>
1836 ? Z 0:00 [httpd] <defunct>
1840 ? Z 0:00 [httpd] <defunct>
1842 ? Z 0:00 [httpd] <defunct>
. . .
And if you can catch the server-status page in time, it looks something like this:
I've been testing this for around a week now, and have gone through many permutations. Nothing has worked so far, but some of my tests at least delayed the inevitable for a while.
Some of the things I've tried:
More StartServers and MinSpareServers
Moving Apache's systemd to use Type=simple instead of Type=notify
Add a sleep call with ExecStartPre to see if there's some kind of race condition with filedescriptors
Switch CMD to start Apache in the foreground, which seems to work and not recreate the problems
I'm out of ideas on what to try next. Watching the children die via strace appears like it has something to do with waiting on closing filedescriptors... but it's difficult to get more information from a zombie process.
I'm investigating an odd and difficult to recreate problem with Apache using prefork MPM, and it seems to only happen inside Docker when using systemctl.
The two main issues I've observed are:
1) When first starting up, Apache will not fork any new children beyond its StartServers + MinSpareServers setting. Also sometimes, it will do one fork event and then stop there and not fork any new children after that 2) When Apache shuts down children via MaxRequestsPerChild to cycle through them, the children become zombies, but still account for an idle slot. Eventually the zombies suck up all of the slots and DoS the whole server
Since both of these are intermittent problems, it's really frustrating to isolate and debug. The best chance I can give to recreate this is:
Dockerfile
On the client side, I was using something like this to recreate the problem with the best chance:
No keep-alive requests, and hammer on it after startup. You can see it happen more slowly with 8 concurrency, and it takes a few minutes before the zombies build up and DoS the server.
After things are locked up, the process table looks like this:
And if you can catch the server-status page in time, it looks something like this:
I've been testing this for around a week now, and have gone through many permutations. Nothing has worked so far, but some of my tests at least delayed the inevitable for a while.
Some of the things I've tried:
Type=simple
instead ofType=notify
CMD
to start Apache in the foreground, which seems to work and not recreate the problemsI'm out of ideas on what to try next. Watching the children die via strace appears like it has something to do with waiting on closing filedescriptors... but it's difficult to get more information from a zombie process.
Any advice or help would be appreciated on this!