gdraheim / docker-systemctl-replacement

docker systemctl replacement - allows to deploy to systemd-controlled containers without starting an actual systemd daemon (e.g. centos7, ubuntu16)
European Union Public License 1.2
1.39k stars 399 forks source link

Apache prefork problems with systemctl #140

Open calh opened 2 years ago

calh commented 2 years ago

I'm investigating an odd and difficult to recreate problem with Apache using prefork MPM, and it seems to only happen inside Docker when using systemctl.

The two main issues I've observed are:

1) When first starting up, Apache will not fork any new children beyond its StartServers + MinSpareServers setting. Also sometimes, it will do one fork event and then stop there and not fork any new children after that 2) When Apache shuts down children via MaxRequestsPerChild to cycle through them, the children become zombies, but still account for an idle slot. Eventually the zombies suck up all of the slots and DoS the whole server

Since both of these are intermittent problems, it's really frustrating to isolate and debug. The best chance I can give to recreate this is:

Dockerfile

# syntax=docker/dockerfile:1.3-labs
FROM centos:centos7
RUN yum install -y httpd
RUN curl https://raw.githubusercontent.com/gdraheim/docker-systemctl-replacement/master/files/docker/systemctl.py > /usr/bin/systemctl \
  && systemctl enable httpd

COPY --chmod=755 <<EOF /var/www/cgi-bin/sleeper.cgi
#!/bin/bash
/bin/sleep 0.2
echo Content-Type: text-plain
echo
echo Hello World
EOF

COPY <<EOF /etc/httpd/conf.d/extra-config.conf
ExtendedStatus on
<Location /server-status>
 SetHandler server-status
 Order allow,deny
 Deny from none
 Allow from all
</Location>

StartServers       2
MinSpareServers    5
MaxSpareServers 20
ServerLimit     2048
MaxClients      2048
MaxRequestWorkers 2048
MaxRequestsPerChild  10
EOF

# Uncomment this to recreate the issue
CMD ["/usr/bin/systemctl", "-vvv"]
# Uncomment this to see it work fine
#STOPSIGNAL SIGWINCH
#CMD ["/usr/sbin/httpd", "-DFOREGROUND"]

On the client side, I was using something like this to recreate the problem with the best chance:

ab -n 1000000 -c 64 http://localhost:8081/cgi-bin/sleeper.cgi

No keep-alive requests, and hammer on it after startup. You can see it happen more slowly with 8 concurrency, and it takes a few minutes before the zombies build up and DoS the server.

After things are locked up, the process table looks like this:

  PID TTY      STAT   TIME COMMAND
    1 ?        Ss     0:00 /usr/bin/python2 /usr/bin/systemctl -vvv
    8 ?        Ss     0:00 /usr/sbin/httpd -DFOREGROUND
  625 ?        Z      0:00 [httpd] <defunct>
 1808 ?        Z      0:00 [httpd] <defunct>
 1809 ?        Z      0:00 [httpd] <defunct>
 1811 ?        Z      0:00 [httpd] <defunct>
 1814 ?        Z      0:00 [httpd] <defunct>
 1815 ?        Z      0:00 [httpd] <defunct>
 1821 ?        Z      0:00 [httpd] <defunct>
 1822 ?        Z      0:00 [httpd] <defunct>
 1823 ?        Z      0:00 [httpd] <defunct>
 1828 ?        Z      0:00 [httpd] <defunct>
 1832 ?        Z      0:00 [httpd] <defunct>
 1836 ?        Z      0:00 [httpd] <defunct>
 1840 ?        Z      0:00 [httpd] <defunct>
 1842 ?        Z      0:00 [httpd] <defunct>
. . . 

And if you can catch the server-status page in time, it looks something like this:

image

I've been testing this for around a week now, and have gone through many permutations. Nothing has worked so far, but some of my tests at least delayed the inevitable for a while.

Some of the things I've tried:

I'm out of ideas on what to try next. Watching the children die via strace appears like it has something to do with waiting on closing filedescriptors... but it's difficult to get more information from a zombie process.

Any advice or help would be appreciated on this!