Open w0rmr1d3r opened 2 years ago
Initial investigation of the exception not retrieved: The task is created here: https://github.com/aio-libs/aiohttp/blob/master/aiohttp/web_protocol.py#L305 Then it looks like it should get awaited at: https://github.com/aio-libs/aiohttp/blob/master/aiohttp/web_protocol.py#L288
If an exception is raised, then .done()
would be True
, so the await
doesn't happen. My initial thought is that there should not be a .done()
check on that code.
Not investigated anything about the assertion error though.
I've also just recently seen the exact error described by @w0rmr1d3r pop up, seemingly at random. Will report back if we're able to get any more clarity or fixes on our end. Thanks everyone for looking :)
Same issue after updating to v3.8.3
Same issue after updating to v3.8.3
Indeed @doublex . This started appearing in our services when upgrading to v3.8.3 It doesn't happen with v3.8.1
There aren't too many changes (if you ignore formatting/typing changes), so if anyone can take a look and figure out which change is the issue, that'd be great.
These 2 removed lines are the only ones that stand out to me, so maybe try adding them back in first: https://github.com/aio-libs/aiohttp/compare/v3.8.1...v3.8.3#diff-2126b277e07e3fdd05e7a81da456accf24e5515a46c78c48a79db4eb166043e4L302-L303
Hi @Dreamsorcerer ,
Thank you for providing the diff between the affected versions.
I'm reading through the code and trying to understand. But you will have more context and knowledge on how this library works. My guessings:
File web_protocol.py on line 505, will always pass the handler as None. Since it's set on line 472 to be the self._task_handler. Which on line 305 is set as None, and the change on the highlighted lines never cancels the task_handler.
Still, I lack complete context of the code, just writing what I see.
Then, in the web_app.py file, line 437 -> 454, protocol will be a RequestHandler, but is received as None, since as seen in this line: https://github.com/aio-libs/aiohttp/blob/8cf01adc8c8dbf706e4cd33bf89fd5195f638715/aiohttp/web_protocol.py#L51 it can be "asyncio.Task[None]".
I guess, that is setting it to None, somehow?
Again, just tried to read through the code, but have 0 context on how the library works.
My guesses might be completely wrong, apologies for that.
File web_protocol.py on line 505, will always pass the handler as None. Since it's set on line 472 to be the self._task_handler.
Line 473 asserts it to be not None, so that's not possible.
My suggestion was to try adding back in the lines 302-303 that were deleted and see if the problem is still reproducible. If it is, then maybe checkout the code at different commits until you find the commit that introduces the problem (you could use git bisect
for this). Also remember to do this on the 3.9 branch, not master.
Hello @Dreamsorcerer , Sorry for my late response.
So, what you mean is, adding back those lines in the next release and see if the issue is fixed? I understand that to be version 3.9.0? And if those lines make the issue go away, re-review the commits on branch 3.9?
Just wanted to clarify and get right the steps on this.
Thank you for providing an answer and maintain this project 🚀
Yes, just clone the 3.9 branch, add those lines in, then test if the issue is fixed (also check the issue is reproducible before adding the line in). If so, make a PR and we'll look at releasing it in 3.9.
Hello @Dreamsorcerer ,
I haven't had time to put effort on this.
We received the dependabot bump PR for 3.8.4
and saw the same error.
Hopefully in the future I might be able to put some effort on this to debug it or something.
3.8.4 only added a moderate severity fix. If you can figure out this issue, we'll put it into the 3.9 release.
Just here to confirm that we can see this issue happening in python 3.9, 3.10 and 3.11 with aiohttp 3.8.4
I can also confirm that this issue occurs in our project with aiohttp 3.8.3 and Python 3.11.
There aren't too many changes (if you ignore formatting/typing changes), so if anyone can take a look and figure out which change is the issue, that'd be great. These 2 removed lines are the only ones that stand out to me, so maybe try adding them back in first: https://github.com/aio-libs/aiohttp/compare/v3.8.1...v3.8.3#diff-2126b277e07e3fdd05e7a81da456accf24e5515a46c78c48a79db4eb166043e4L302-L303
@Dreamsorcerer I tried reverting these lines on top of v3.8.3 and this does not fix the issue. They were also the only lines standing out to me in the git history, so I'll need to take a deeper look.
Looking at the logic, it seems like this originates from the connection_lost
path: the transport is set to None early in BaseProtocol.connection_lost()
. I could not reproduce this issue locally though so no way to tell for sure. I didn't see anything out of the ordinary in the code, so maybe the issue is somewhere else, ex: some level of concurrency happening between connection_lost()
and the request handler?
My setup:
@Dreamsorcerer I tried reverting these lines on top of v3.8.3 and this does not fix the issue. They were also the only lines standing out to me in the git history, so I'll need to take a deeper look.
If you can run it from the git repo, then a git bisect
or something could help find the commit which introduced the issue.
@Dreamsorcerer I tried reverting these lines on top of v3.8.3 and this does not fix the issue. They were also the only lines standing out to me in the git history, so I'll need to take a deeper look.
If you can run it from the git repo, then a
git bisect
or something could help find the commit which introduced the issue.
I'd love to, but I have no idea how to build the package properly to test inside a Docker image (only solution I have to reproduce the issue at the moment). I've looked at the docs but I couldn't find a tip, I always end up with "aiohttp/_websocket.c: File not found". Any idea?
In the meantime, I realized I hadn't installed my patched version correctly. It appears that applying the patch below as suggested fixes the issue.
diff --git a/aiohttp/web_protocol.py b/aiohttp/web_protocol.py
index 10a96080..f103b167 100644
--- a/aiohttp/web_protocol.py
+++ b/aiohttp/web_protocol.py
@@ -299,6 +299,8 @@ class RequestHandler(BaseProtocol):
exc = ConnectionResetError("Connection lost")
self._current_request._cancel(exc)
+ if self._task_handler is not None:
+ self._task_handler.cancel()
if self._waiter is not None:
self._waiter.cancel()
Now, I saw in the commit history that this patch has already been applied, removed and reapplied (cf. the discussion here: https://github.com/aio-libs/aiohttp/issues/6719). What's the best course of action? I don't see cancellation exceptions in my setup, maybe the whole issue discussed in #6719 has been fixed in the meantime.
Now, I saw in the commit history that this patch has already been applied, removed and reapplied (cf. the discussion here: #6719). What's the best course of action?
OK, I've got a vague idea on what the cause may be now. Can you try removing the and not self._task_handler.done()
(not on a production server though, I expect it to break):
https://github.com/aio-libs/aiohttp/blob/master/aiohttp/web_protocol.py#L287
My theory is that on connection_lost on older releases (and optionally in 3.9) we cancel the handler. Now we are not cancelling, I suspect there may be an exception occurring on self.transport.close()
or something similar, because the connection was already closed:
https://github.com/aio-libs/aiohttp/blob/master/aiohttp/web_protocol.py#L610
So, I'm hoping that by making that change, instead of an exception was not retrieved warning, you'll get the actual exception and then we can catch that error in future.
This also suggests that we may be able to build an actual test, if anyone wants to give that a go? Hopefully, it just means we need an endpoint that takes a couple of seconds and then have a client disconnect prematurely to trigger the warning.
I've tried replicating this issue in a test, but it passes just fine on the 3.8 branch. Am I approaching this in a wrong way?
async def test_client_connection_lost(aiohttp_client: Any) -> None:
async def handler(_):
await asyncio.sleep(2)
return web.Response()
app = web.Application()
app.router.add_route("GET", "/", handler)
client = await aiohttp_client(app)
with pytest.raises(TimeoutError):
await asyncio.wait_for(
client.get("/"),
timeout=1,
)
await client.close()
I've set a breakpoint in BaseProtocol.connection_lost
and saw it being called twice, once for the client, then another time for the server.
I've never managed to reproduce it, which is why I keep asking others to test/debug.
I am currently facing this issue, I do not think it's easily reproduced in a test as you need to send some sort of client/server disconnection into the request/response handler.
Task exception was never retrieved
future: <Task finished name='Task-70887' coro=<RequestHandler.start() done, defined at /opt/venv/lib/python3.10/site-packages/aiohttp/web_protocol.py:462> exception=AssertionError()>
I will go down back to 3.8.1
.
If my theory is right, then we should be able to write a test. But, I'm still waiting on someone to confirm the cause by hopefully getting the traceback as described in https://github.com/aio-libs/aiohttp/issues/6978#issuecomment-1520627194
I have been pretty busy since April but I'll try to reproduce the issue in the weeks to come.
OK, I played around with the code, manually adding in exceptions etc. and I couldn't figure out any way to reproduce this. Hopefully, #7333 will help, but I don't think there's anything else I can do unless someone figures something else out.
Once #7333 is released I can definitely give it a try.
Still happening for v3.8.5 which includes #7333
Continuing the conversation that started here.
I think this ticket is the correct one for my case given that the traceback is the same here.
self._force_close, self._close
were always False
and nothing broke from doing this comment neither.Is there any way that I can add additional logs to test this in the current version of the library that I use in prod (3.8.5)? Maybe call some aiohttp method and print something upon any end point being hit?
Appreciate your help with trying to figure this out.
2. To test @Dreamsorcerer your suggestions, I need to deploy a custom version of aiohttp to the cloud (GAE), but whatever I tried I couldn't make GAE to use my modified version of the library and it either continued to use the version from PyPI or fail to run.
I'm not failiar with GAE, but can you just drop aiohttp into the project directory, where it loads your app from?
Is there any way that I can add additional logs to test this in the current version of the library that I use in prod (3.8.5)? Maybe call some aiohttp method and print something upon any end point being hit?
I can't think of an easy way. You could try and monkey patch the library, but you'd probably have to copy the entire function (e.g. .start() method), add the print in, and then assign it to the aiohttp class.
I'm not failiar with GAE, but can you just drop aiohttp into the project directory, where it loads your app from?
Or failing that, copy into the app's directory as a module, and then change the imports to from myapp import aiohttp
, if necessary (assuming that doesn't break aiohttp's imports).
Good news, I made it work. The issue was that I didn't change the path in the app's yml file:
entrypoint: gunicorn --timeout 60 -b :$PORT main:app_factory --worker-class vendor.aiohttp.GunicornUVLoopWebWorker
With it and with from vendor import aiohttp
changed in 50 files haha, it finally worked.
I deployed with both changes you proposed.
if self._force_close or self._close:
print(f"Line waiter.set_result(None): self._force_close = {self._force_close}, "
f"self._close = {self._close}")
Will keep posted when it happens.
So the Task exception was never retrieved future
continued happening without any prints or additional exceptions around it. You can see them below. Note that the trace has my custom version of aiohttp
with the changes that @Dreamsorcerer proposed.
python.exc_info: "Traceback (most recent call last):
File "/workspace/vendor/aiohttp/web_protocol.py", line 512, in start
request = self._request_factory(message, payload, self, writer, handler)
File "/workspace/vendor/aiohttp/web_app.py", line 446, in _make_request
return _cls(
File "/workspace/vendor/aiohttp/web_request.py", line 811, in __init__
super().__init__(*args, **kwargs)
File "/workspace/vendor/aiohttp/web_request.py", line 188, in __init__
assert transport is not None
AssertionError
"
python.filename: "web_request.py"
python.funcName: "__init__"
python.lineno: 189
python.logger_name: "asyncio"
python.module: "web_request"
python.pid: 15
python.thread: "MainThread"
python.exc_info: "Traceback (most recent call last):
File "/workspace/vendor/aiohttp/web_protocol.py", line 512, in start
request = self._request_factory(message, payload, self, writer, handler)
File "/workspace/vendor/aiohttp/web_app.py", line 446, in _make_request
return _cls(
File "/workspace/vendor/aiohttp/web_request.py", line 811, in __init__
super().__init__(*args, **kwargs)
File "/workspace/vendor/aiohttp/web_request.py", line 188, in __init__
assert transport is not None
AssertionError
"
python.filename: "adapters.py"
python.funcName: "__init__"
python.lineno: 75
python.logger_name: "asyncio"
python.module: "adapters"
python.pid: 15
python.thread: "MainThread"
python.exc_info: "Traceback (most recent call last):
File "/workspace/vendor/aiohttp/web_protocol.py", line 512, in start
request = self._request_factory(message, payload, self, writer, handler)
File "/workspace/vendor/aiohttp/web_app.py", line 446, in _make_request
return _cls(
File "/workspace/vendor/aiohttp/web_request.py", line 811, in __init__
super().__init__(*args, **kwargs)
File "/workspace/vendor/aiohttp/web_request.py", line 188, in __init__
assert transport is not None
AssertionError
"
python.filename: "poolmanager.py"
python.funcName: "__init__"
python.lineno: 174
python.logger_name: "asyncio"
python.module: "poolmanager"
python.pid: 14
python.thread: "MainThread"
This line below got printed a few times:
if self._force_close or self._close:
print(f"Line - request = self._request_factory(message, payload, self, writer, handler): "
f"self._force_close = {self._force_close}, self._close = {self._close}")
Result of the print every time:
Line - request = self._request_factory(message, payload, self, writer, handler): self._force_close = True, self._close = False
But I think that happened every time when there was another issue: 2023/08/04 09:00:09 [error] 16#16: *195 upstream prematurely closed connection while reading response header from upstream, client: ***, server: _, request: "GET /v3/user/parameters HTTP/1.1", upstream: "http://127.0.0.1:8081/v3/user/parameters", host: "****.appspot.com"
.
That's some other issue that I haven't figured out yet, but I don't think it's related to aiohttp
.
This line below got printed a few times:
That's right before the self._request_factory(...)
, right?
Did you add a print at waiter.set_result(...)
as well? (I'm guessing, if you did, it didn't print anything because you added the if
round it, in which case it'd confirm my thought that a race condition is occurring).
Yes, this never got printed:
if messages and waiter is not None and not waiter.done():
# don't set result twice
if self._force_close or self._close:
print(f"Line waiter.set_result(None): self._force_close = {self._force_close}, "
f"self._close = {self._close}")
waiter.set_result(None)
this got printed ~10 times in different places from Task exception was never retrieved
if self._force_close or self._close:
print(f"Line - request = self._request_factory(message, payload, self, writer, handler): "
f"self._force_close = {self._force_close}, self._close = {self._close}")
request = self._request_factory(message, payload, self, writer, handler)
and every time self._force_close = True, self._close = False
Yep, so seems like it's definitely some kind of race condition. Not sure on the exact solution, will look at it later, but I suspect we can probably just add another check in the latter place. I'm a little short on time, so would be great if someone can take a look. A test would be good too, hopefully it's possible to write a test that closes it as soon as the waiter is set, and before the second bit of code runs (probably, waiter.add_done_callback(...)
would work).
Hi, any chance we can get a fix on this?
Information is above if you want to try and create a test and fix.
I've just tested it in 3.9.0 and it's still present.
Well, that's because nobody has fixed it yet and why the issue is still open...
There have been changes to the transport code so it could have been fixed. I'd love to help debug it, but I haven't been able to reproduce locally, just in production, so it's quite hard for us to debug using custom versions.
I still haven't had time to look, but what I mentioned before might be possible to reproduce in a test. Last sentence: https://github.com/aio-libs/aiohttp/issues/6978#issuecomment-1666026186
Refamiliarising myself with the code, I think we need the test to get into the .start()
method until it is waiting at:
https://github.com/aio-libs/aiohttp/blob/122597fce2cf9a1d4c0b2ebaa0a8be567f19816c/aiohttp/web_protocol.py#L515
Then we can do something like r._waiter.add_done_callback(lambda f: r.force_close())
(or maybe shutdown() or connection_lost()).
At worst, that can probably be done by subclassing and changing the property setter to add another add_done_callback().
The objective is to get it to resume running in .start() and reach this code while _force_close == True
:
https://github.com/aio-libs/aiohttp/blob/122597fce2cf9a1d4c0b2ebaa0a8be567f19816c/aiohttp/web_protocol.py#L534
I'm still not clear how it's possible to get into this situation, but I can force it with a test like this:
async def test_race_condition_exception(aiohttp_client: Any) -> None:
async def handler(request):
data = await request.text()
return web.Response(text=data)
orig_data_received = RequestHandler.data_received
def data_received(self, data: bytes) -> None:
self.force_close()
orig_data_received(self, data)
with mock.patch.object(RequestHandler, "data_received", data_received):
app = web.Application()
app.router.add_get("/", handler)
client = await aiohttp_client(app)
size = 1000000
with pytest.raises(ServerDisconnectedError):
async with client.get("/", data=io.BytesIO(b"1" * size)) as resp:
assert resp.status == 200
However, the exception or warning is not actually being logged, so there's no indication that it's happened. From adding print statements, I know it's hitting that assertion error though...
Describe the bug
We bumped from v3.8.1 to v3.8.3 this morning and started seeing the following errors. No other changes were done to our project:
And
To Reproduce
No other changes introduced from our side, those errors started appearing.
Expected behavior
For that exception to be handled.
Logs/tracebacks
And
Python Version
aiohttp Version
multidict Version
yarl Version
OS
Docker image
Related component
Server
Additional context
I have added as much context as I had, if I left out anything relevant, happy to look for it and add it to the issue.
Code of Conduct