GNS3 / gns3-server

GNS3 server
GNU General Public License v3.0
768 stars 258 forks source link

Address the telnet console death bug. #2347

Closed spikefishjohn closed 4 months ago

spikefishjohn commented 5 months ago

Add wait_for for drain() call with a 5 second timeout. If we're stuck on drain then the buffer isn't getting emptied. 5 seconds after drain() blocks, exception will be thrown and client will be removed from connection table and will no longer be a problem. There will be a pause in console output while in this 5 second wait, after that the console will return to normal.

When the exception is triggered the follow debug will be logged.

2024-02-03 03:07:15 DEBUG telnet_server.py:309 Timeout while sending data to client: ('REMOTE_IP', REMOTE_PORT), closing and removing from connection table.

It just dawned on me that peerinfo() is a list so if you want that reformatted it would be easy enough to do.

Fixes Issue 2344

spikefishjohn commented 5 months ago

oops, just noticed its a 10 second timeout and not 5.

spikefishjohn commented 4 months ago

One additional note about the fix. I switched to looking up items by keys instead of values and returning a list so that we can safely remove items from the dictionary while transversion.

cristian-ciobanu commented 4 months ago

@spikefishjohn so merging this pull request will fix completely the console freeze bug when using telnet ?

spikefishjohn commented 4 months ago

It does for me. May want to lower the timeout as 10 seconds feels a little long. That being said the patch could use additional testing. Do you have a way to replicate the issue btw?

The way I recreate the issue is clear the firewall's connection table for the telnet session (I'm accessing GNS3 through a firewall to be clear). This will cause the firewall to do a silent drop on the tcp session. Then open a 2nd telnet session and do something that creates a large amount of output on the console. I use find / on a linux node for example.

Without the patch the console will output data for about 10 seconds or so then lock up. You'll be able to open new sessions and get connected and even send data but you won't see anything happening on the console as the echo of your data doesn't get sent to you.

With the patch there will be a pause while GNS3 is trying to communicate with the down telnet session. After 10 seconds it will give up, delete the session and then session 2 will start working again as well as any new session.

Also note the next version of GNS3 will have tcp keepalives timers lowered so that it sends keep alives more often then every 2 hours. This will prevent a firewall from killing a session from being idle for too long but won't fix the broken console issue.

cristian-ciobanu commented 4 months ago

Do you have a way to replicate the issue btw?

I can only replicate this only by leaving the telnet console session opened for a while (no activity for several hours I think) until I receive the message connection refused.

spikefishjohn commented 4 months ago

hmmmm.. connection refused. Ok I can't say for sure this will address your issue but only because I don't know what happens if the telnet console is left in a bad state for long enough. I personally have never seen it throw connection refused, which would mean the listening socket closed. That being said if the GNS3 node continues to try to push data out and it can't write data at some point I would expect the service to do something bad (like crash) ... so its possible.

If you can give me more details i'll see if I can replicate or by all means give the update a try and report back.

For me to replicate can you tell me how your accessing the telnet console network topology wise? To give you an idea of bullet points i'm looking for.

cristian-ciobanu commented 4 months ago

Sorry I got a bit confused. I have two different issues related to telnet console connections.

  1. The telnet console session freezes if I leave opened for a while (no activity for several hours I think) no message received.
  2. When I use auto-start console option for devices the console of the devices automatically pops in with the message connection refused (using SecureCRT for accessing console of devices). If I close the console and reopen again it starts to work fine.

I have two different setups for GNS3.

  1. GNS3 server 2.2.44 running on bare metal Ubuntu 22.04 machine and GNS3 client 2.2.44 running remotely on Windows.
  2. GNS3 3.0.0 beta (server and client) running on bare metal on Linux (same machine Archlinux)

The first issue with the telnet console freeze happens only with the remote server setup not when running both server and client on same machine. The second issues happens randomly so it is not easy to replicate. I think maybe when I start multiple devices at once and their all open their console, but I need to do some more extensive testing.

spikefishjohn commented 4 months ago

ah ok.

For sure, this will fix issue number 1. I'll be my burrito on it. Please give the patch a try to see if I owe you one.

Issue 2 sounds like a completely different thing. Like a timing issue where it hasn't waited for the device to return running or maybe for the console to have started. I would make a new bug report for that.

cristian-ciobanu commented 4 months ago

@grossmj 3.0 branch does not need this patch applied or the other pull request using telnetlib3 rewrite should in theory fix this annoying issue ?

grossmj commented 4 months ago

@grossmj 3.0 branch does not need this patch applied or the other pull request using telnetlib3 rewrite should in theory fix this annoying issue ?

The patch will be merged into 3.0 and the other pull request using telnetlib3 should make it there as well once it has been sufficiently tested.

spikefishjohn commented 4 months ago

sooo.. uh.. loaded up 2.2.46. Opened a telnet console. Slept laptop, went to bed. Work up. Double click vyos box.

... nothing...

.....

image
spikefishjohn commented 4 months ago

EDIT: This is completely wrong. Doh! I need to stop the late night bug research.

I did notice something. this calls except if it fails which calls close() which then does this.

   async def close(self):
        for writer, connection in self._connections.items():
            try:
                writer.write_eof()
                await writer.drain()
                writer.close()
                # await writer.wait_closed()  # this doesn't work in Python 3.6
            except (AttributeError, ConnectionError):
                continue

So my theory is the drain call needs to have a wait_for as well. I'm going to see if I can recreate the problem then take a swing at it again.