benoitc / gunicorn

gunicorn 'Green Unicorn' is a WSGI HTTP Server for UNIX, fast clients and sleepy applications.
http://www.gunicorn.org
Other
9.66k stars 1.74k forks source link

Gunicorn gthread deadlock #3146

Open javiertejero opened 5 months ago

javiertejero commented 5 months ago

Problem

On receiving multiple connections at the same time I observe a deadlock that causes gunicorn worker to completely freeze forever (the worker timeout does not kill the frozen worker).

This happens when the number of worker_connections is smaller than the number of concurrent connections received.

Affected versions:

How to reproduce

Run a gunicorn app with gthread worker and these options:

    options = {
        # ...
        'workers': 1,
        'threads': 3,
        'worker_connections': 4,
    }

Then run apache bench with enough concurrency (higher than 4): ab -n1000 -c100 http://0.0.0.0:8080/

I am opening a PR soon with the fix and the script to reproduce this easily (UPDATE: PR #3147).

Proposed solution

Partial revert of #2918

javiertejero commented 5 months ago

@JorisOnGithub could you please confirm if this patch is not breaking all the improvements needed to solve #2917

PS: I couldn't reproduce the issues you described, but I don't have WebView2, and I couldn't reproduce with Microsoft Edge for macos

chris-griffin commented 5 months ago

Hi @benoitc and team - First off, thank you all for what you do! I tried to contact security@gunicorn.org about this issue, but the email bounced.

image

Since this issue is already out in the public it seemed appropriate to post here instead. This issue is a denial-of-service (DOS) vulnerability although not labeled as such in the original description.

While the default of 1000 for worker_connections should hopefully mean that implementors that have reasonable rate limits in upstream reverse proxies/WAFs should be protected, anyone with a rate limit >1000 per IP address would make this trivial to exploit. Additionally, an effective DDoS with a relatively low number of IP addresses should be more plausible.

I see that gunicorn has not historically used GitHub's security advisory feature. I just wanted to highlight that it makes it very easy to request a CVE in case you are not familiar with this feature.

sylt commented 5 months ago

I've debugged the issue, and I think the smallest diff for fixing it (while maintaining the functionality of #2917) is:

diff --git a/gunicorn/workers/gthread.py b/gunicorn/workers/gthread.py
index c9c42345..9c1a92ad 100644
--- a/gunicorn/workers/gthread.py
+++ b/gunicorn/workers/gthread.py
@@ -119,6 +119,9 @@ class ThreadWorker(base.Worker):

     def accept(self, server, listener):
         try:
+            if self.nr_conns >= self.worker_connections:
+                return
+
             sock, client = listener.accept()
             # initialize the connection object
             conn = TConn(self.cfg, sock, client, server)
@@ -218,6 +221,10 @@ class ThreadWorker(base.Worker):
                                       return_when=futures.FIRST_COMPLETED)
             else:
                 # wait for a request to finish
+                events = self.poller.select(0.0)
+                for key, _ in events:
+                    callback = key.data
+                    callback(key.fileobj)
                 result = futures.wait(self.futures, timeout=1.0,
                                       return_when=futures.FIRST_COMPLETED)

For those interested, I think the root cause of the bug is the following:

  1. All is good. There are no connections. We're running in run(), specifically in self.poller.select(), waiting for a connection to arrive to us.
  2. All of a sudden, someone wants to connect. self.poller.select() returns our self.accept() function, and we enter and accept the connection. The resulting socket returned from accept() is added to the event queue.
  3. Step 2. Is repeated until we can't accept any more connections (i.e., self.nr_conns hit the limit self.worker_connections).
  4. Now, it's time for another loop in run(). Since we've hit our connection limit, the code will avoid executing self.poller.select(), since we shouldn't accept any new connections. Unfortunately, since our accept()ed sockets are also on the event queue, we won't process their requests either! Oops...
  5. So we loop ad infinitum. Since the newly added sockets haven't gotten their initial handlers run, there won't be any futures to process either. And doing futures.wait(self.futures) with self.futures = [] returns immediately, and then it's back to the beginning again of the loop, turning the loop into a busy loop.

About the patch: by running self.poller.select() even in the case when we want to avoid accepting connections, we'll always serve our already accepted connections, thereby fixing the bug. However, we need an extra if statement in self.accept() to protect us from accepting too many connections.

I had some other clean-ups in mind which would fix this bug, but I'm just posting this minimal version early on in case @benoitc would like to incorporate it.

staaam commented 2 weeks ago

@benoitc I've just experienced same issue (gunicorn gthread deadlock) when trying to upgrade 20.1.0 -> 22.0.0 . Currently using in-place fix suggested in https://github.com/benoitc/gunicorn/pull/3147