Telnet console death issue

spikefishjohn commented 5 months ago

@grossmj

Since there isn't really an open issue to track the original death of the telnet console I'm opening this. I've narrowed down where the issue is. I haven't figured out what to do about it just yet.

Here is a link to the telnet_server.py with debugs showing the issue. telnet_server.py

Here is the debug logs when the death loop happens.

2024-01-30 23:13:41 DEBUG telnet_server.py:271 __process True loop
2024-01-30 23:13:41 DEBUG telnet_server.py:273 __process  reader_read is None
2024-01-30 23:13:41 DEBUG telnet_server.py:251 Start async _get_reader
2024-01-30 23:13:41 DEBUG telnet_server.py:256 self._reader_process == network_reader
2024-01-30 23:13:41 DEBUG telnet_server.py:257 <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=11 re
ad=polling write=<idle, bufsize=0>>> == <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=13 read=poll
ing write=<polling, bufsize=65550>>>
2024-01-30 23:13:41 DEBUG telnet_server.py:262 _get_reader Returning None
2024-01-30 23:13:41 DEBUG telnet_server.py:275 __process reader_read is still 2nd None
2024-01-30 23:13:41 DEBUG telnet_server.py:277 __process reader_read is still 3rd None
2024-01-30 23:13:42 DEBUG telnet_server.py:292 __process just before coro done check
2024-01-30 23:13:42 DEBUG telnet_server.py:293
2024-01-30 23:13:42 DEBUG telnet_server.py:294
2024-01-30 23:13:42 DEBUG telnet_server.py:295
2024-01-30 23:13:42 DEBUG telnet_server.py:271 __process True loop
2024-01-30 23:13:42 DEBUG telnet_server.py:273 __process  reader_read is None
2024-01-30 23:13:42 DEBUG telnet_server.py:251 Start async _get_reader
2024-01-30 23:13:42 DEBUG telnet_server.py:256 self._reader_process == network_reader
2024-01-30 23:13:42 DEBUG telnet_server.py:257 <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=11 re
ad=polling write=<idle, bufsize=0>>> == <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=13 read=poll
ing write=<polling, bufsize=65550>>>
2024-01-30 23:13:42 DEBUG telnet_server.py:262 _get_reader Returning None
2024-01-30 23:13:42 DEBUG telnet_server.py:275 __process reader_read is still 2nd None
2024-01-30 23:13:42 DEBUG telnet_server.py:277 __process reader_read is still 3rd None
2024-01-30 23:13:43 DEBUG telnet_server.py:292 __process just before coro done check
2024-01-30 23:13:43 DEBUG telnet_server.py:293
2024-01-30 23:13:43 DEBUG telnet_server.py:294
2024-01-30 23:13:43 DEBUG telnet_server.py:295
2024-01-30 23:13:43 DEBUG telnet_server.py:271 __process True loop
2024-01-30 23:13:43 DEBUG telnet_server.py:273 __process  reader_read is None
2024-01-30 23:13:43 DEBUG telnet_server.py:251 Start async _get_reader
2024-01-30 23:13:43 DEBUG telnet_server.py:256 self._reader_process == network_reader
2024-01-30 23:13:43 DEBUG telnet_server.py:257 <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=11 re
ad=polling write=<idle, bufsize=0>>> == <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=13 read=poll
ing write=<polling, bufsize=65550>>>
2024-01-30 23:13:43 DEBUG telnet_server.py:262 _get_reader Returning None
2024-01-30 23:13:43 DEBUG telnet_server.py:275 __process reader_read is still 2nd None
2024-01-30 23:13:43 DEBUG telnet_server.py:277 __process reader_read is still 3rd None
2024-01-30 23:13:44 DEBUG telnet_server.py:292 __process just before coro done check
2024-01-30 23:13:44 DEBUG telnet_server.py:293
2024-01-30 23:13:44 DEBUG telnet_server.py:294
2024-01-30 23:13:44 DEBUG telnet_server.py:295
etc etc etc

So.. the interesting thing here _get_reader is always returning None. This then causes a infinity loop in _process.

As you can see based on the debugs above _get_reader is making it all the way to that final if statement, which doesn't match and them returns None. I've looked at this a few time and I don't the intention of that compare.

BTW ignore the "__process reader_read is still 2nd None" log. I didn't have that in the right place and didn't feel like redoing everything.

Some other interesting things to point out. self._reader_process has a 0 byte buffer, but network_reader has a buffer of 65550. I have to poke around pdb to see what I can find, but if i recall from looking at this the socket on self._reader_process technically closed so its impossible to write data to it. I think this is why reader's buffer isn't getting drained. I'm not sure if maybe by calling future on a dead socket is what is causing this.

The way I'm personally triggering this is to setup a firewall rule then delete a telnet console session as i'm running find / on 2 telnet consoles. The telnet consoles will wedge in 10 seconds or so.

I still plan on finishing up proxy replacement, but its not ready. I got to the point where I can start and stop a qemu VM and the console works, but if i delete the devices I'm getting exceptions. It also doesn't support all the options the original telnet_server supports (such as passing in reader/writer).

spikefishjohn commented 5 months ago

I'm really starting to question if this is the reason the new console breaks. I will say once in this loop there is nothing to delete the connection that is in a failed state.

spikefishjohn commented 5 months ago

ok I think I have this dialed in now. So many more debugs..

This shows a bad connection isn't being dealt with correctly.

024-02-02 23:09:13 DEBUG telnet_server.py:284 Start async _get_reader 2024-02-02 23:09:13 DEBUG telnet_server.py:290 Start async _get_reader - self._reader_process is not None 2024-02-02 23:09:13 DEBUG telnet_server.py:291 self._reader_process == network_reader 2024-02-02 23:09:13 DEBUG telnet_server.py:292 <StreamReader exception=TimeoutError(110, 'Connection timed out') transport=<_SelectorSocketTransport closed fd=16>> == <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=13 read=polling write=<idle, bufsize=0>>> 2024-02-02 23:09:13 DEBUG telnet_server.py:298 _get_reader Returning None

As you can see self._reader_process has a StreamReader that has an exception thrown with a dead socket, but it hasn't been removed.

I'm going to add a wait_for on a drain call. If it fails after say 30 seconds then we haven't been able to get data out to a low water mark on the socket and its most likely dead. Then yank from connection dictionary and in theory this problem goes away.

Crossing fingers...

spikefishjohn commented 5 months ago

Fix located here.

PR2347

grossmj commented 4 months ago

Thanks, I am going to check everything soon (need to catch up after holidays) 👍

grossmj commented 4 months ago

Awesome, thanks for the fix, I hope we have finally squashed that annoying bug 👍

spikefishjohn commented 4 months ago

Its.. uh.. not.

@grossmj Can you re-open this while I yell into the void in an overly dramatic manner?

spikefishjohn commented 4 months ago

Here is the exception. I don't know how long it was hung on this call, but you can see it does timeout after a while. My guess is 2 hours. Oh I also feel like maybe the keepalives aren't always being set for some reason, but no evidence.

Knee jerk reaction is asyncio.wait_for() again. I'll put sometime into it this weekend.

2024-03-01 02:36:54 ERROR base_events.py:1758 Task exception was never retrieved
future: <Task finished name='Task-1745' coro=<AsyncioTelnetServer.run() done, defined at /usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/utils/asyncio/telnet_server.py:188> exception=TimeoutError(110, 'Connection timed out') created at /usr/lib/python3.10/asyncio/streams.py:242>
source_traceback: Object created at (most recent call last):
  File "/usr/bin/gns3server", line 8, in <module>
    sys.exit(main())
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/main.py", line 87, in main
    run()
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/run.py", line 267, in run
    server.run()
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/web/web_server.py", line 352, in run
    self._loop.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1901, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/lib/python3.10/asyncio/streams.py", line 242, in connection_made
    self._task = self._loop.create_task(res)
Traceback (most recent call last):
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/utils/asyncio/telnet_server.py", line 207, in run
    await self._process(network_reader, network_writer, connection)
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/utils/asyncio/telnet_server.py", line 269, in _process
    data = coro.result()
  File "/usr/lib/python3.10/asyncio/streams.py", line 669, in read
    await self._wait_for_data('read')
  File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data
    await self._waiter
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 862, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
TimeoutError: [Errno 110] Connection timed out

GNS3 / gns3-server

Telnet console death issue #2344