Open javiertejero opened 5 months ago
@JorisOnGithub could you please confirm if this patch is not breaking all the improvements needed to solve #2917
PS: I couldn't reproduce the issues you described, but I don't have WebView2, and I couldn't reproduce with Microsoft Edge for macos
Hi @benoitc and team - First off, thank you all for what you do! I tried to contact security@gunicorn.org about this issue, but the email bounced.
Since this issue is already out in the public it seemed appropriate to post here instead. This issue is a denial-of-service (DOS) vulnerability although not labeled as such in the original description.
While the default of 1000 for worker_connections should hopefully mean that implementors that have reasonable rate limits in upstream reverse proxies/WAFs should be protected, anyone with a rate limit >1000 per IP address would make this trivial to exploit. Additionally, an effective DDoS with a relatively low number of IP addresses should be more plausible.
I see that gunicorn has not historically used GitHub's security advisory feature. I just wanted to highlight that it makes it very easy to request a CVE in case you are not familiar with this feature.
I've debugged the issue, and I think the smallest diff for fixing it (while maintaining the functionality of #2917) is:
diff --git a/gunicorn/workers/gthread.py b/gunicorn/workers/gthread.py
index c9c42345..9c1a92ad 100644
--- a/gunicorn/workers/gthread.py
+++ b/gunicorn/workers/gthread.py
@@ -119,6 +119,9 @@ class ThreadWorker(base.Worker):
def accept(self, server, listener):
try:
+ if self.nr_conns >= self.worker_connections:
+ return
+
sock, client = listener.accept()
# initialize the connection object
conn = TConn(self.cfg, sock, client, server)
@@ -218,6 +221,10 @@ class ThreadWorker(base.Worker):
return_when=futures.FIRST_COMPLETED)
else:
# wait for a request to finish
+ events = self.poller.select(0.0)
+ for key, _ in events:
+ callback = key.data
+ callback(key.fileobj)
result = futures.wait(self.futures, timeout=1.0,
return_when=futures.FIRST_COMPLETED)
For those interested, I think the root cause of the bug is the following:
run()
, specifically in self.poller.select()
, waiting for a connection to arrive to us.self.poller.select()
returns our self.accept()
function, and we enter and accept the connection. The resulting socket returned from accept() is added to the event queue.self.nr_conns
hit the limit self.worker_connections
). run()
. Since we've hit our connection limit, the code will avoid executing self.poller.select()
, since we shouldn't accept any new connections. Unfortunately, since our accept()ed sockets are also on the event queue, we won't process their requests either! Oops...futures.wait(self.futures)
with self.futures = []
returns immediately, and then it's back to the beginning again of the loop, turning the loop into a busy loop.About the patch: by running self.poller.select()
even in the case when we want to avoid accepting connections, we'll always serve our already accepted connections, thereby fixing the bug. However, we need an extra if statement in self.accept()
to protect us from accepting too many connections.
I had some other clean-ups in mind which would fix this bug, but I'm just posting this minimal version early on in case @benoitc would like to incorporate it.
@benoitc I've just experienced same issue (gunicorn gthread deadlock) when trying to upgrade 20.1.0 -> 22.0.0 . Currently using in-place fix suggested in https://github.com/benoitc/gunicorn/pull/3147
Problem
On receiving multiple connections at the same time I observe a deadlock that causes gunicorn worker to completely freeze forever (the worker timeout does not kill the frozen worker).
This happens when the number of
worker_connections
is smaller than the number of concurrent connections received.Affected versions:
How to reproduce
Run a gunicorn app with gthread worker and these options:
Then run apache bench with enough concurrency (higher than 4):
ab -n1000 -c100 http://0.0.0.0:8080/
I am opening a PR soon with the fix and the script to reproduce this easily (UPDATE: PR #3147).
Proposed solution
Partial revert of #2918