Open spikefishjohn opened 5 months ago
I'm really starting to question if this is the reason the new console breaks. I will say once in this loop there is nothing to delete the connection that is in a failed state.
ok I think I have this dialed in now. So many more debugs..
This shows a bad connection isn't being dealt with correctly.
024-02-02 23:09:13 DEBUG telnet_server.py:284 Start async _get_reader 2024-02-02 23:09:13 DEBUG telnet_server.py:290 Start async _get_reader - self._reader_process is not None 2024-02-02 23:09:13 DEBUG telnet_server.py:291 self._reader_process == network_reader 2024-02-02 23:09:13 DEBUG telnet_server.py:292 <StreamReader exception=TimeoutError(110, 'Connection timed out') transport=<_SelectorSocketTransport closed fd=16>> == <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=13 read=polling write=<idle, bufsize=0>>> 2024-02-02 23:09:13 DEBUG telnet_server.py:298 _get_reader Returning None
As you can see self._reader_process has a StreamReader that has an exception thrown with a dead socket, but it hasn't been removed.
I'm going to add a wait_for on a drain call. If it fails after say 30 seconds then we haven't been able to get data out to a low water mark on the socket and its most likely dead. Then yank from connection dictionary and in theory this problem goes away.
Crossing fingers...
Fix located here.
Thanks, I am going to check everything soon (need to catch up after holidays) 👍
Awesome, thanks for the fix, I hope we have finally squashed that annoying bug 👍
Its.. uh.. not.
@grossmj Can you re-open this while I yell into the void in an overly dramatic manner?
Here is the exception. I don't know how long it was hung on this call, but you can see it does timeout after a while. My guess is 2 hours. Oh I also feel like maybe the keepalives aren't always being set for some reason, but no evidence.
Knee jerk reaction is asyncio.wait_for() again. I'll put sometime into it this weekend.
2024-03-01 02:36:54 ERROR base_events.py:1758 Task exception was never retrieved
future: <Task finished name='Task-1745' coro=<AsyncioTelnetServer.run() done, defined at /usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/utils/asyncio/telnet_server.py:188> exception=TimeoutError(110, 'Connection timed out') created at /usr/lib/python3.10/asyncio/streams.py:242>
source_traceback: Object created at (most recent call last):
File "/usr/bin/gns3server", line 8, in <module>
sys.exit(main())
File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/main.py", line 87, in main
run()
File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/run.py", line 267, in run
server.run()
File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/web/web_server.py", line 352, in run
self._loop.run_forever()
File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/usr/lib/python3.10/asyncio/base_events.py", line 1901, in _run_once
handle._run()
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/lib/python3.10/asyncio/streams.py", line 242, in connection_made
self._task = self._loop.create_task(res)
Traceback (most recent call last):
File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/utils/asyncio/telnet_server.py", line 207, in run
await self._process(network_reader, network_writer, connection)
File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/utils/asyncio/telnet_server.py", line 269, in _process
data = coro.result()
File "/usr/lib/python3.10/asyncio/streams.py", line 669, in read
await self._wait_for_data('read')
File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data
await self._waiter
File "/usr/lib/python3.10/asyncio/selector_events.py", line 862, in _read_ready__data_received
data = self._sock.recv(self.max_size)
TimeoutError: [Errno 110] Connection timed out
@grossmj
Since there isn't really an open issue to track the original death of the telnet console I'm opening this. I've narrowed down where the issue is. I haven't figured out what to do about it just yet.
Here is a link to the telnet_server.py with debugs showing the issue. telnet_server.py
Here is the debug logs when the death loop happens.
So.. the interesting thing here _get_reader is always returning None. This then causes a infinity loop in _process.
As you can see based on the debugs above _get_reader is making it all the way to that final if statement, which doesn't match and them returns None. I've looked at this a few time and I don't the intention of that compare.
BTW ignore the "__process reader_read is still 2nd None" log. I didn't have that in the right place and didn't feel like redoing everything.
Some other interesting things to point out. self._reader_process has a 0 byte buffer, but network_reader has a buffer of 65550. I have to poke around pdb to see what I can find, but if i recall from looking at this the socket on self._reader_process technically closed so its impossible to write data to it. I think this is why reader's buffer isn't getting drained. I'm not sure if maybe by calling future on a dead socket is what is causing this.
The way I'm personally triggering this is to setup a firewall rule then delete a telnet console session as i'm running find / on 2 telnet consoles. The telnet consoles will wedge in 10 seconds or so.
I still plan on finishing up proxy replacement, but its not ready. I got to the point where I can start and stop a qemu VM and the console works, but if i delete the devices I'm getting exceptions. It also doesn't support all the options the original telnet_server supports (such as passing in reader/writer).