Websocket Request can hang due to fd being removed from event loop

AaronOpfer commented 2 years ago

Describe the bug

A websocket connection will hang forever instead of opening successfully in some unknown circumstances, seemingly related to stream management.

Under these circumstances, strace -e trace=sendto,epoll_ctl shows that aiohttp removes an active socket from the event loop and then reuses that socket for the websocket connection.

sendto(7, "GET / HTTP/1.1\r\nHost: localhost:"..., 124, 0, NULL, 0) = 124
epoll_ctl(4, EPOLL_CTL_DEL, 7, 0x7ffc10132a4c) = 0
Downloaded index page successfully
sendto(7, "GET /w HTTP/1.1\r\nHost: localhost"..., 238, 0, NULL, 0) = 238
[ .. then it hangs .. ]

To Reproduce

Use Tornado 6.1 to create the following server.

import tornado.web, tornado.ioloop, tornado.websocket
import asyncio

class Index(tornado.web.RequestHandler):
    def get(self):
        self.write("A" * 164500)

class WebSocket(tornado.websocket.WebSocketHandler):
    def on_connect(self):
        asyncio.get_event_loop().call_later(3, self.close)

    def on_message(self, msg):
        pass

def main():
    app = tornado.web.Application(
        [
            ("/", Index),
            ("/w", WebSocket),
        ]
    )
    app.listen(12345)
    tornado.ioloop.IOLoop.current().start()

if __name__ == "__main__":
    main()

Then, use the following client to attempt to connect to it using aiohttp 3.8.1.

import aiohttp
import asyncio

async def do_the_test():
    async with aiohttp.ClientSession() as session:
        for count in range(1000):
            try:
                async with session.get("http://localhost:12345/") as index_resp:
                    index_resp.raise_for_status()
                    # await index_resp.read()  # REQUIRED! or WS connect will hang
                print("Downloaded index page successfully")

                ws = await session.ws_connect("ws://localhost:12345/w")
                print("Websocket connection established")
                await ws.close()
                return
            except aiohttp.ClientConnectorError:
                await asyncio.sleep(1)
                if count < 4:
                    print("wait...")
                if count == 5:
                    print("start repro_server.py now")

async def amain():
    await asyncio.wait_for(do_the_test(), timeout=30)

def main():
    import asyncio

    asyncio.get_event_loop().run_until_complete(amain())

if __name__ == "__main__":
    main()

For some reason, reproducing the problem is psuedo-random. The while loop in the client was kept intact because it seems like reproduction happens much more often when the client is started first and the server is started afterward. It's not clear why this is. I recommend doing while true; python repro_client.py; sleep 1; done and to wait for reproduction. Feel free to start and stop the server to your content during this time to see if it causes reproduction of the problem.

You'll know the problem occurred when you see output that looks like:

Downloaded index page successfully

without any new output "Websocket connection established", and instead a TimeoutError.

In addition, if you do strace -e trace=connect,epoll_ctl,sendto python repro_client.py, when the bug occurs, you'll see the following output (at the bottom)

sendto(7, "GET / HTTP/1.1\r\nHost: localhost:"..., 124, 0, NULL, 0) = 124
epoll_ctl(4, EPOLL_CTL_DEL, 7, 0x7ffc10132a4c) = 0
Downloaded index page successfully
sendto(7, "GET /w HTTP/1.1\r\nHost: localhost"..., 238, 0, NULL, 0) = 238

A hang at this point implies that the FD 7 should have not been removed from the epoll FD.

Expected behavior

I expect the index download to be followed by a complete websocket handshake each time without any timeouts.

Logs/tracebacks

server-side

$ strace -e trace=sendto,accept python repro_server_tornado.py
sendto(6, {{len=20, type=RTM_GETADDR, flags=NLM_F_REQUEST|NLM_F_DUMP, seq=1644472784, pid=0}, {ifa_family=AF_UNSPEC, ...}}, 20, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 20
sendto(9, "\0", 1, 0, NULL, 0)          = 1
sendto(10, "HTTP/1.1 200 OK\r\nServer: Tornado"..., 161197, 0, NULL, 0) = 161197
sendto(10, "HTTP/1.1 101 Switching Protocols"..., 193, 0, NULL, 0) = 193

client-side

$ strace -e trace=connect,sendto,epoll_ctl python repro_client.py
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=18158, si_uid=71820, si_status=0, si_utime=0, si_stime=0} ---
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=18164, si_uid=71820, si_status=0, si_utime=0, si_stime=0} ---
epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=5, u64=94493575479301}}) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=94493575479303}}) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 7, 0x7fffd1fcf9dc) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=94493575479303}}) = 0
sendto(7, "GET / HTTP/1.1\r\nHost: localhost:"..., 124, 0, NULL, 0) = 124
epoll_ctl(4, EPOLL_CTL_DEL, 7, 0x7fffd1fcef8c) = 0
Downloaded index page successfully
sendto(7, "GET /w HTTP/1.1\r\nHost: localhost"..., 238, 0, NULL, 0) = 238
Traceback (most recent call last):
  File "repro_client.py", line 33, in <module>
    main()
  File "repro_client.py", line 29, in main
    asyncio.get_event_loop().run_until_complete(amain())
  File "<CENSORED>/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "repro_client.py", line 23, in amain
    await asyncio.wait_for(do_the_test(), timeout=30)
  File "<CENSORED>/python3.6/asyncio/tasks.py", line 362, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError
+++ exited with 1 +++


### Python Version

```console
$ python --version
Python 3.6.8 :: Anaconda, Inc.

aiohttp Version

$ python -m pip show aiohttp
documentation for alternative uses
  import imp
Name: aiohttp
Version: 3.8.1
Summary: Async http client/server framework (asyncio)
Home-page: https://github.com/aio-libs/aiohttp
Author:
Author-email:
License: Apache 2
Location: <CENSORED>
Requires: aiosignal, async-timeout, asynctest, attrs, charset-normalizer, frozenlist, idna-ssl, multidict, typing-extensions, yarl
Required-by: aiohttp-security, aiohttp-session, aioinflux, jupyter-server-proxy, kubernetes-asyncio

multidict Version

$ python -m pip show multidict
Name: multidict
Version: 5.2.0
Summary: multidict implementation
Home-page: https://github.com/aio-libs/multidict
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: Apache 2
Location: <CENSORED>
Requires:
Required-by: aiohttp, yarl

yarl Version

$ python -m pip show yarl
Name: yarl
Version: 1.7.2
Summary: Yet another URL library
Home-page: https://github.com/aio-libs/yarl/
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: Apache 2
Location: <CENSORED>
Requires: idna, multidict, typing-extensions
Required-by: aiohttp

OS

$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 7.9 (Maipo)
Release:        7.9
Codename:       Maipo

$ uname -a
Linux XXXXXXXXXX 3.10.0-1160.53.1.el7.x86_64 #1 SMP Thu Dec 16 10:19:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Related component

Client

Additional context

I detected this in a test suite where a web application was launched, it would be have requests sent continuously until its index page came up, and then a websocket would be opened. In some circumstances, this connection would hang forever. It appears to have to do with filling just enough of the client's read buffer to trigger some kind of watermarking behavior. As such, reproduction steps seem to be highly sensitive to the environment. If there is trouble reproducing, I would recommend tweaking the response size of the "Index" request. Mine is currently mimicking the real-world scenario where I detected this problem.

If I use aiohttp as the server software, reproduction of the problem appears impossible. I noticed that aiohttp actually calls sendto twice, once with the response headers and again with the response body, and as such it may be triggering substantially different buffer management code in the client than the tornado server's response, which responds with just one sendto call.

aiohttp server, NOT producing the bug

sendto(8, "HTTP/1.1 200 OK\r\nContent-Type: t"..., 154, 0, NULL, 0) = 154
sendto(8, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 165000, 0, NULL, 0) = 165000
sendto(8, "HTTP/1.1 101 Switching Protocols"..., 240, 0, NULL, 0) = 240

Tornado server. reproducing the bug

sendto(10, "HTTP/1.1 200 OK\r\nServer: Tornado"..., 161197, 0, NULL, 0) = 161197
sendto(10, "HTTP/1.1 101 Switching Protocols"..., 193, 0, NULL, 0) = 193

I find that, if I make my HTTP request to download the index also fully download from the HTTP server (await resp.read()) instead of simply letting the context manager lapse, then the problem seems to not trigger. This code can be found commented out in the client repro code. This is quite surprising and doesn't seem like something that clients should be forced to do, especially since control flow could change in some exception circumstances and developers expect context managers to take care of this sort of issue.

Code of Conduct

[X] I agree to follow the aio-libs Code of Conduct

Dreamsorcerer commented 3 weeks ago

Is this still reproducible? I'm not able to reproduce in current release.

AaronOpfer commented 3 weeks ago

I no longer have access to the source repository where I reproduced this originally, and in any case, with two years of no response, the workaround was implemented a long time ago.

Did you reproduce the bug on the package versions I originally quoted?

bdraco commented 3 weeks ago

I noticed that aiohttp actually calls sendto twice, once with the response headers and again with the response body.

At least with the current version, the two sendto calls should only happen if the message size is larger than 2**14

AaronOpfer commented 3 weeks ago

I noticed that aiohttp actually calls sendto twice, once with the response headers and again with the response body.

At least with the current version, the two sendto calls should only happen if the message size is larger than 2**14

That's nice to hear, although I would underscore here that the problem was not in the aiohttp server behavior but in the aiohttp client behavior, where it stopped watching the HTTP socket for reading when it really should not have. It seems like an edge condition around buffer management, and so I was highlighting that there were subtle differences in the buffer management of the aiohttp and tornado servers. However, both of those servers are still being compliant with HTTP.

aio-libs / aiohttp