Rare bug - gunicorn worker stuck during TCP socket sendall

ldkv commented 8 months ago

Hi,

I encounter this issue while running a Django server with gunicorn to serve some remote servers with Docker services that send requests to gunicorn occasionally.

Bug description

On some remote servers, the connection is always timeout when the response reaches a certain threshold (1KB or more for example). When this happens, the gunicorn worker becomes stuck in the sock.sendall(data) loop until it times out for sync worker, or forever for other workers.

Here is the stack trace with sync worker:

[20 24-02-16 00:19:18 +0000] [448] [ERROR] Error handling request 
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 184, in handle_request
resp.write(item)
File "/usr/local/lib/python3.11/site-packages/gunicorn/http/wsgi.py", line 346, in write
util.write(self.sock, arg, self.chunked)
File "/usr/local/lib/python3.11/site-packages/gunicorn/util.py", line 299, in write
sock.sendall(data)
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/base.py", line 202, in handle_abort
self.cfg.worker_abort(self)
File "/app/config/gunicorn_configs.py", line 58, in worker_abort
raise Exception(f"Gunicorn worker aborted: {worker}")

When I switch to gthread, the thread is stuck for around 952 seconds consistently with the following error (same stack trace):

TimeoutError: [Errno 110] Connection timed out

The cause

After some research and testing, I can narrow down the cause of this bug as a combination of 2 separate problems:

Docker network problem: I can't pinpoint the exact cause, since I have thousands of remote servers with supposedly identical configurations, and only 5 out of them trigger this bug. My best guess is a combination of VPN and corrupt Docker network.
gunicorn socket handling: no timeout on socket operations, which leads to potential deadlock in some specific cases as this one, where the client is already timeout, but gunicorn keeps sending the response somewhere.

I can reproduce this bug consistently by sending requests from the 5 servers that I mention, but honestly I don't know how to reproduce it on other systems.

Anyway, regardless of the root cause of the first problem, the 2nd problem where the deadlock on gunicorn socket is real and should be dealt with. If anyone figures out a way to reproduce it consistently, they can timeout/block all workers easily.

Workaround

To deal with this problem on my systems where I use gthread, I decided to monkey patch the util.write method:

import gunicorn

SOCKET_TIMEOUT = 10

def patch_gunicorn_util_write(sock, data, chunked=False):
    original_timeout = sock.gettimeout()
    sock.settimeout(SOCKET_TIMEOUT)
    try:
        if chunked:
            return util.write_chunk(sock, data)
        sock.sendall(data)
    except Exception as e:
        sock.settimeout(original_timeout)
        raise e
    finally:
        sock.settimeout(original_timeout)

gunicorn.util.write = patch_gunicorn_util_write

This sets the socket timeout temporarily during this method, and reset it to original value at the end. This way, there is zero impact on the sock object outside of this method.

It works well and the worker thread always timeout correctly instead of getting stuck.

Proposed long term solution

For all workers other than sync, I suggest to impose the timeout settings on socket object with settimeout. It makes sense since the socket operations should never exceeds the worker timeout anyway.

If it is not feasible due to other constraints, we should be able to call the util.write method with timeout as additional parameter.

I can make a PR to address this issue. What do you think?

ldkv commented 7 months ago

I found the root cause and how to reproduce the issue reliably.

My application is connected to a wireguard VPN tunnel and receive requests from remote servers via the VPN. If there is mismatched MTU size between the wireguard interface and the remote server network, the response from gunicorn is never received by the client.

However gunicorn keeps sending data to the socket and is basically stuck, even when the client is already timeout.

ldkv commented 7 months ago

My suggestion is to set the socket timeout to the config worker timeout, instead of forever as of now.

It makes sense since a socket operation should never exceed the worker timeout anyway.

What do you think?

benoitc / gunicorn