Open AaronOpfer opened 2 years ago
Is this still reproducible? I'm not able to reproduce in current release.
I no longer have access to the source repository where I reproduced this originally, and in any case, with two years of no response, the workaround was implemented a long time ago.
Did you reproduce the bug on the package versions I originally quoted?
I noticed that aiohttp actually calls
sendto
twice, once with the response headers and again with the response body.
At least with the current version, the two sendto calls should only happen if the message size is larger than 2**14
I noticed that aiohttp actually calls
sendto
twice, once with the response headers and again with the response body.At least with the current version, the two sendto calls should only happen if the message size is larger than 2**14
That's nice to hear, although I would underscore here that the problem was not in the aiohttp
server behavior but in the aiohttp
client behavior, where it stopped watching the HTTP socket for reading when it really should not have. It seems like an edge condition around buffer management, and so I was highlighting that there were subtle differences in the buffer management of the aiohttp
and tornado
servers. However, both of those servers are still being compliant with HTTP.
Describe the bug
A websocket connection will hang forever instead of opening successfully in some unknown circumstances, seemingly related to stream management.
Under these circumstances,
strace -e trace=sendto,epoll_ctl
shows thataiohttp
removes an active socket from the event loop and then reuses that socket for the websocket connection.To Reproduce
Use Tornado 6.1 to create the following server.
Then, use the following client to attempt to connect to it using aiohttp 3.8.1.
For some reason, reproducing the problem is psuedo-random. The while loop in the client was kept intact because it seems like reproduction happens much more often when the client is started first and the server is started afterward. It's not clear why this is. I recommend doing
while true; python repro_client.py; sleep 1; done
and to wait for reproduction. Feel free to start and stop the server to your content during this time to see if it causes reproduction of the problem.You'll know the problem occurred when you see output that looks like:
without any new output "Websocket connection established", and instead a TimeoutError.
In addition, if you do
strace -e trace=connect,epoll_ctl,sendto python repro_client.py
, when the bug occurs, you'll see the following output (at the bottom)A hang at this point implies that the FD 7 should have not been removed from the epoll FD.
Expected behavior
I expect the index download to be followed by a complete websocket handshake each time without any timeouts.
Logs/tracebacks
client-side
aiohttp Version
multidict Version
yarl Version
OS
Related component
Client
Additional context
I detected this in a test suite where a web application was launched, it would be have requests sent continuously until its index page came up, and then a websocket would be opened. In some circumstances, this connection would hang forever. It appears to have to do with filling just enough of the client's read buffer to trigger some kind of watermarking behavior. As such, reproduction steps seem to be highly sensitive to the environment. If there is trouble reproducing, I would recommend tweaking the response size of the "Index" request. Mine is currently mimicking the real-world scenario where I detected this problem.
If I use
aiohttp
as the server software, reproduction of the problem appears impossible. I noticed that aiohttp actually callssendto
twice, once with the response headers and again with the response body, and as such it may be triggering substantially different buffer management code in the client than the tornado server's response, which responds with just onesendto
call.aiohttp server, NOT producing the bug
Tornado server. reproducing the bug
I find that, if I make my HTTP request to download the index also fully download from the HTTP server (
await resp.read()
) instead of simply letting the context manager lapse, then the problem seems to not trigger. This code can be found commented out in the client repro code. This is quite surprising and doesn't seem like something that clients should be forced to do, especially since control flow could change in some exception circumstances and developers expect context managers to take care of this sort of issue.Code of Conduct