Open ldkv opened 8 months ago
I found the root cause and how to reproduce the issue reliably.
My application is connected to a wireguard VPN tunnel and receive requests from remote servers via the VPN. If there is mismatched MTU size between the wireguard interface and the remote server network, the response from gunicorn is never received by the client.
However gunicorn keeps sending data to the socket and is basically stuck, even when the client is already timeout.
My suggestion is to set the socket timeout to the config worker timeout, instead of forever as of now.
It makes sense since a socket operation should never exceed the worker timeout anyway.
What do you think?
Hi,
I encounter this issue while running a Django server with gunicorn to serve some remote servers with Docker services that send requests to gunicorn occasionally.
Bug description
On some remote servers, the connection is always timeout when the response reaches a certain threshold (1KB or more for example). When this happens, the gunicorn worker becomes stuck in the
sock.sendall(data)
loop until it times out for sync worker, or forever for other workers.Here is the stack trace with sync worker:
When I switch to
gthread
, the thread is stuck for around 952 seconds consistently with the following error (same stack trace):The cause
After some research and testing, I can narrow down the cause of this bug as a combination of 2 separate problems:
Docker network problem: I can't pinpoint the exact cause, since I have thousands of remote servers with supposedly identical configurations, and only 5 out of them trigger this bug. My best guess is a combination of VPN and corrupt Docker network.
gunicorn socket handling: no timeout on socket operations, which leads to potential deadlock in some specific cases as this one, where the client is already timeout, but gunicorn keeps sending the response somewhere.
I can reproduce this bug consistently by sending requests from the 5 servers that I mention, but honestly I don't know how to reproduce it on other systems.
Anyway, regardless of the root cause of the first problem, the 2nd problem where the deadlock on gunicorn socket is real and should be dealt with. If anyone figures out a way to reproduce it consistently, they can timeout/block all workers easily.
Workaround
To deal with this problem on my systems where I use
gthread
, I decided to monkey patch theutil.write
method:This sets the socket timeout temporarily during this method, and reset it to original value at the end. This way, there is zero impact on the
sock
object outside of this method.It works well and the worker thread always timeout correctly instead of getting stuck.
Proposed long term solution
For all workers other than sync, I suggest to impose the timeout settings on socket object with
settimeout
. It makes sense since the socket operations should never exceeds the worker timeout anyway.If it is not feasible due to other constraints, we should be able to call the
util.write
method withtimeout
as additional parameter.I can make a PR to address this issue. What do you think?