This MR is a followup to this issue mentioned here : https://github.com/benoitc/gunicorn/pull/2938
Why this MR : So that we can use the latest release with our requirement of load balancing across workers
As previously noted in issues https://github.com/benoitc/gunicorn/issues/2100, https://github.com/benoitc/gunicorn/issues/1984, https://github.com/benoitc/gunicorn/issues/2465, https://github.com/benoitc/gunicorn/issues/2095, https://github.com/benoitc/gunicorn/issues/2192, and https://github.com/benoitc/gunicorn/issues/2168 - Gunicorn isn't correctly load balancing requests across processes, even if you flip on reuse_port.
An aside to really spell out what the problem is:
As @tilgovi noted in https://github.com/benoitc/gunicorn/pull/2101 and https://github.com/benoitc/gunicorn/pull/2102, the root of the problem is that when the workers fork, the socket data structure from the master process is reused for each worker. Fundamentally, this means SO_REUSEPORT doesn't work as designed, since there's really just one socket object being bound to the address.
We tried https://github.com/benoitc/gunicorn/pull/2101 as a fix for this - but soon ran into a variety of bizarre crashes including “socket operation on non-socket” and “Bad file descriptor”. After quite a bit of debugging, we found a use after free in Gunicorn's gevent worker https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/ggevent.py#L41-L45.
In this code, the original objects in self.sockets (before line 45) are eventually garbage collected and thus closed, leading to "Bad file descriptor". If the file descriptor gets reused, it can lead to "socket operation on non-socket". If you call socket.detach as in this patch - the socket is preserved by the process and it doesn't matter if socket gets garbage collected.
I ran some experiments to quantify the effects of this patch - here are some results.
All these processes are running gunicorn --workers=8 -k gevent --reuse-port test:app.
I collate the number of requests handled by each PID:
This MR is a followup to this issue mentioned here : https://github.com/benoitc/gunicorn/pull/2938 Why this MR : So that we can use the latest release with our requirement of load balancing across workers