aio-libs / aiohttp

Asynchronous HTTP client/server framework for asyncio and Python
https://docs.aiohttp.org
Other
15.14k stars 2.02k forks source link

web_ws - Can't close tcp socket when receiving a close message from client. #8184

Closed spikefishjohn closed 7 months ago

spikefishjohn commented 8 months ago

Describe the bug

I'm working on a https://github.com/GNS3/gns3-server/issues/2320 for GNS3 that I think PR7978 lines up with. The GNS3 issue is that when a web socket client sends a close message the server never closes the tcp socket. I tried the patch above but it doesn't address the issue.

High level GNS3 is calling ws.receive() at which point aiohttp receive() the close message.

I believe the problem is because ws.receive is setting self._closing = True when it reiceves a close message. This then causes self.close() to return here which prevents if msg.type == WSMsgType.CLOSE: from being reached.

I'm almost thinking this should be removed but i'm not sure what the intent of that is so i'm unsure if that is the proper fix.

To Reproduce

I don't have a great way to reproduce this. I'm currently just using a fully installed GNS3 instance. I can work on making a reproduction now that I understand the issue.

Expected behavior

aiohttp.web_ws should close the socket when it receives a close message.

Logs/tracebacks

I'll add these later in the week. I wasn't prepared for this.

Python Version

$ python --version
Python 3.10.12

aiohttp Version

$ python -m pip show aiohttp
(gns3-server) root@ubuntu:~# source /usr/share/gns3/gns3-server/bin/activate
(gns3-server) root@ubuntu:~# python3 -m pip show aiohttp
Name: aiohttp
Version: 3.9.1
Summary: Async http client/server framework (asyncio)
Home-page: https://github.com/aio-libs/aiohttp
Author:
Author-email:
License: Apache 2
Location: /usr/share/gns3/gns3-server/lib/python3.10/site-packages
Requires: aiosignal, async-timeout, attrs, frozenlist, multidict, yarl
Required-by: aiohttp-cors, gns3-server
(gns3-server) root@ubuntu:~#

multidict Version

$ python -m pip show multidict
(gns3-server) root@ubuntu:~# python3 -m pip show multidict
Name: multidict
Version: 6.0.5
Summary: multidict implementation
Home-page: https://github.com/aio-libs/multidict
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: Apache 2
Location: /usr/share/gns3/gns3-server/lib/python3.10/site-packages
Requires:
Required-by: aiohttp, yarl
(gns3-server) root@ubuntu:~#

yarl Version

$ python -m pip show yarl
(gns3-server) root@ubuntu:~# python3 -m pip show yarl
Name: yarl
Version: 1.9.4
Summary: Yet another URL library
Home-page: https://github.com/aio-libs/yarl
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: Apache-2.0
Location: /usr/share/gns3/gns3-server/lib/python3.10/site-packages
Requires: idna, multidict
Required-by: aiohttp
(gns3-server) root@ubuntu:~#
### OS

(gns3-server) root@ubuntu:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy
(gns3-server) root@ubuntu:~#

Related component

Server

Additional context

No response

Code of Conduct

EDIT: Update PR request in description.

spikefishjohn commented 8 months ago

Not sure what I did to strike through the OS but that is all correct.

Dreamsorcerer commented 8 months ago

It does look a little odd, and seems a little obvious which makes me wonder why it hasn't been noticed before.

Do you think you could write a test in a PR to reproduce it? Then we could also try your suggestion of removing that check and seeing if any other tests break, which would be a good indication if it's the wrong change to make.

bdraco commented 8 months ago

I think the code is expecting that the client will close the connection, but we can't always rely on that even if they say they're going to close it, they may not.

spikefishjohn commented 8 months ago

yeah, I forked aiohttp last night. I'll work on a small replication. I did try removing the call. This results in the next read request in close() failing with timeout error and then the socket does close but with a .. uh.. 1006 message I think.

I also started looking out how to pass message to close() so I could add a condition for the read statement. Didn't get very far but it was kind of late.

Dreamsorcerer commented 8 months ago

https://datatracker.ietf.org/doc/html/rfc6455#section-7

Once an endpoint has both sent and received a Close control frame, that endpoint SHOULD Close the WebSocket Connection

So, this is what it is trying to achieve when close is initiated by us.

As such, when a server is instructed to Close the WebSocket Connection it SHOULD initiate a TCP Close immediately, and when a client is instructed to do the same, it SHOULD wait for a TCP Close from the server.

Except as indicated above or as specified by the application layer (e.g., a script using the WebSocket API), clients SHOULD NOT close the connection.

I think we just need to tweak that code so that the server does not wait to receive a close code (as it has already received one), but it should continue to close the transport. i.e. Rather than removing that check we want to add self._set_code_close_transport(...) in there (but, we might need to avoid calling it twice?).

spikefishjohn commented 8 months ago

so something like this instead of deleting? I looked through the code and it seems like _close_code is always set but i'm not sure if its safe to assume it is? Default is None.

        if self._closing:
            self._set_code_close_transport(self._close_code)
            return True
Dreamsorcerer commented 8 months ago

but i'm not sure if its safe to assume it is?

Not sure, I'd have to look more closely. If we get the test up first, then we can see how it works.

spikefishjohn commented 8 months ago

Boiling this down has been harder then I expected. So far I'm not recreating the issue. More research and foul language will be required.

bdraco commented 8 months ago

I'm pretty convinced there is a bug here where we don't close the transport if the client holds it open forever

bdraco commented 8 months ago

@spikefishjohn Can you give https://github.com/aio-libs/aiohttp/pull/8200 a shot? I'm running it on my production HA systems without any unexpected side effects

I'll come up with a test for it if it fixes your issue

spikefishjohn commented 8 months ago

@bdraco yeah i'll give it a test. I've been trying to reproduce the issue with a smaller code base but haven't been able to so far which has been pretty frustrating.

I'll add this in today and see if it fixes the issue. If not i'll add a pdb trace of the accept to show what is happening if that helps.

spikefishjohn commented 8 months ago

Ok that seems to fix the issue. Here is the client I'm using to talk to gns3.

import asyncio
import base64
import json
import websockets
import logging

logger = logging.getLogger('websockets')
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
log = logging.getLogger(__name__)

# GNS3 WebSocket server violates RFC 6455 so we have to be active closer
# Lets give Websockets a chance to get data.
WS_CLOSE_TIMEOUT = 10

RECONNECT_TIMEOUT = 1.618

CONTROLLER_WS_API = '/v2/notifications/ws'
COMPUTE_WS        = '/v2/notifications/ws'
SERVER            = '127.0.0.1:3080'
USER              = 'XXX'
PASS              = 'XXX'
CREDS             = f'{USER}:{PASS}'
ENCODED_CREDS     = base64.b64encode(CREDS.encode()).decode()
CONTROLLER_URI    = f'ws://{SERVER}{CONTROLLER_WS_API}'
COMPUTE_URI       = f'ws://{SERVER}{COMPUTE_WS}'

async def main() -> None:
    async with asyncio.TaskGroup() as tasks:
        tasks.create_task(websocket_logger(CONTROLLER_URI))

async def websocket_logger(endpoint: str) -> None:
    headers = {
        'Authorization': f'Basic {ENCODED_CREDS}'
    }
    try:
        async with websockets.connect(endpoint, close_timeout=WS_CLOSE_TIMEOUT, extra_headers=headers) as websocket:
            print("Call close")
            await websocket.close()
            print("close complete")
    except ConnectionRefusedError:
        log.info(f'Connection to {endpoint!r} refused.')
        await asyncio.sleep(RECONNECT_TIMEOUT)

if __name__ == '__main__':
    asyncio.run(main())

This is what the client now reports.

john@compute01:~$ python3.11 ws-client.py
2024-03-02 11:40:39,775 - DEBUG - = connection is CONNECTING
2024-03-02 11:40:39,775 - DEBUG - > GET /v2/notifications/ws HTTP/1.1
2024-03-02 11:40:39,775 - DEBUG - > Host: 127.0.0.1:3080
2024-03-02 11:40:39,775 - DEBUG - > Upgrade: websocket
2024-03-02 11:40:39,775 - DEBUG - > Connection: Upgrade
2024-03-02 11:40:39,775 - DEBUG - > Sec-WebSocket-Key: XXX
2024-03-02 11:40:39,775 - DEBUG - > Sec-WebSocket-Version: 13
2024-03-02 11:40:39,775 - DEBUG - > Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
2024-03-02 11:40:39,776 - DEBUG - > Authorization: Basic XXX
2024-03-02 11:40:39,776 - DEBUG - > User-Agent: Python/3.11 websockets/12.0
2024-03-02 11:40:39,779 - DEBUG - < HTTP/1.1 101 Switching Protocols
2024-03-02 11:40:39,779 - DEBUG - < Upgrade: websocket
2024-03-02 11:40:39,779 - DEBUG - < Connection: upgrade
2024-03-02 11:40:39,779 - DEBUG - < Sec-WebSocket-Accept: XXX
2024-03-02 11:40:39,779 - DEBUG - < Sec-WebSocket-Extensions: permessage-deflate
2024-03-02 11:40:39,779 - DEBUG - < Date: Sat, 02 Mar 2024 16:40:39 GMT
2024-03-02 11:40:39,779 - DEBUG - < Server: Python/3.10 aiohttp/3.9.3
2024-03-02 11:40:39,779 - DEBUG - = connection is OPEN
Call close
2024-03-02 11:40:39,779 - DEBUG - = connection is CLOSING
2024-03-02 11:40:39,779 - DEBUG - > CLOSE 1000 (OK) [2 bytes]
2024-03-02 11:40:39,780 - DEBUG - < TEXT '{"action": "ping", "event": {"cpu_usage_percent...y_usage_percent": 3.4}}' [84 bytes]
2024-03-02 11:40:39,781 - DEBUG - < CLOSE 1000 (OK) [2 bytes]
2024-03-02 11:40:39,781 - DEBUG - = connection is CLOSED
close complete
john@compute01:~$

I'll pass this on the original poster of the bug and have them test as well.

bdraco commented 8 months ago

I'll pass this on the original poster of the bug and have them test as well.

Thanks. Please keep us updated.

spikefishjohn commented 8 months ago

FYI the bug reporter indicated they won't be able to test for a week or so.

bdraco commented 8 months ago

@spikefishjohn Was the reporter able to test the linked PR? Thanks

spikefishjohn commented 8 months ago

@bdraco I'm asking for an update.

bdraco commented 8 months ago

Thanks!

spikefishjohn commented 8 months ago

@bdraco The original poster of the GNS3 issues has indicated they will not be able to test this and asked that the original GNS3 bug be closed. Your patch fixed the client I made above for GNS3. I think that is as much as a reply as this will get.

That being said if you think there is something else I can do to help by all means let me know.