aio-libs / aiohttp

Asynchronous HTTP client/server framework for asyncio and Python
https://docs.aiohttp.org
Other
15.13k stars 2.02k forks source link

Technical CancelledError bubbling up where it should not #8175

Closed dmoklaf closed 1 month ago

dmoklaf commented 8 months ago

Describe the bug

I am using a large-scale crawler based on aiohttp client. Some old web servers break connections randomly when under load (hence this bug is difficult to reproduce). When they do, in rare cases, the aiohttp client emits a CancelledError. Investigating the code, I found where this is coming from: https://github.com/aio-libs/aiohttp/blob/006fbe03fede4eaa1eeba7b8393cbf4d63cb44b6/aiohttp/client_reqrep.py#L1020 https://github.com/aio-libs/aiohttp/blob/006fbe03fede4eaa1eeba7b8393cbf4d63cb44b6/aiohttp/client_reqrep.py#L1037

These methods do not suppress the CancelledError that may be raised internally by the _writer (a task wrapping the write_bytes method). This task emits a CancelledError in very rare cases, when it is cancelled by the aiohttp internal machinery before having been truly started by the asyncio scheduler. In this case, none of the internal CancelledError try...except mechanisms it has are called, and the CancelledError, entirely technical to serve the internal aiohttp logic, bubbles up to the library caller.

To Reproduce

Difficult to reproduce (a large scale crawler is necessary to reach this error)

Expected behavior

The CancelledError, being purely internal, shall not bubble up to the library caller. It shall be intercepted in the 2 methods above.

To do so, a robust CancelledError suppression mechanism, as the one posted in #8174, can be used.

However, I am not familiar with the internals of the aiohttp libray, I am not sure if both methods need to be patched, one only or none (different fix needed).

Logs/tracebacks

None

Python Version

$ python --version
Python 3.11.6

aiohttp Version

$ python -m pip show aiohttp
Name: aiohttp
Version: 3.9.3
Summary: Async http client/server framework (asyncio)
Home-page: https://github.com/aio-libs/aiohttp
Author: 
Author-email: 
License: Apache 2
Location: /opt/homebrew/Caskroom/miniforge/base/envs/work/lib/python3.11/site-packages
Requires: aiosignal, attrs, frozenlist, multidict, yarl
Required-by: aiohttp-jinja2, asyncpraw, asyncprawcore, datasets, httpstan, pystan

multidict Version

$ python -m pip show multidict
Name: multidict
Version: 6.0.5
Summary: multidict implementation
Home-page: https://github.com/aio-libs/multidict
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: Apache 2
Location: /opt/homebrew/Caskroom/miniforge/base/envs/work/lib/python3.11/site-packages
Requires: 
Required-by: aiohttp, yarl

yarl Version

$ python -m pip show yarl
Name: yarl
Version: 1.9.4
Summary: Yet another URL library
Home-page: https://github.com/aio-libs/yarl
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: Apache-2.0
Location: /opt/homebrew/Caskroom/miniforge/base/envs/work/lib/python3.11/site-packages
Requires: idna, multidict
Required-by: aiohttp, asyncprawcore

OS

MacOS Sonoma 14.2.1

Related component

Client

Additional context

No response

Code of Conduct

bdraco commented 8 months ago

However, I am not familiar with the internals of the aiohttp libray, I am not sure if both methods need to be patched, one only or none (different fix needed).

Its likely both since the cancellation of the writer shouldn't be leaking upward

dmoklaf commented 8 months ago

It looks like this refactoring would also allow to simplify the code of the write_bytes method underlying the _writer task: https://github.com/aio-libs/aiohttp/blob/006fbe03fede4eaa1eeba7b8393cbf4d63cb44b6/aiohttp/client_reqrep.py#L557

This method catches CancelledError which would be useless with the new mechanism described above. Also, this method seems to catch other errors, but not completely. In some cases, it exposes itself to new exception being raised, e.g. in

https://github.com/aio-libs/aiohttp/blob/006fbe03fede4eaa1eeba7b8393cbf4d63cb44b6/aiohttp/client_reqrep.py#L584

and

https://github.com/aio-libs/aiohttp/blob/006fbe03fede4eaa1eeba7b8393cbf4d63cb44b6/aiohttp/client_reqrep.py#L594

So there seems to be an unconsistent mix of behaviors regarding exceptions handling/raising. But maybe connection management (ie detecting broken connections and bubbling up these I/O errors) make it necessary.

Reading the entire class, it looks like there is a lot of complexity around the _writer task to work around the various cases. Maybe all this _writer code could be simplified by using a clean single callback connected to the task, together with a Future, to handle properly the various task exit cases.

I am definitely not versed enough in this repository to guess what would be the correct change to perform to align all this on a single, simple behavior.