TurboVNC / turbovnc

Main TurboVNC repository
https://TurboVNC.org
GNU General Public License v2.0
747 stars 137 forks source link

TurboVnc Server/Session freeze after network interruption #359

Closed techdude closed 1 year ago

techdude commented 1 year ago

I've been seeing a problem with Vnc sessions crashing while using them, possibly related to network interruptions. I'm still looking into the network issues to determine the specific cause, however, I've noticed that TurboVNC Server doesn't gracefully recover after a viewer disconnects. Sometimes after the crash, I can reconnect right away. Sometimes if I start the viewer and click connect, it will disappear (but the javaw process is still running) for 20-30 minutes and then all of a sudden the password box will pop back up, presumably after some sort of timeout. Rarely, but much more frustrating, sometimes it won't show back up after 20-30 minutes, and indeed, will just be stuck and not accepting any connections to the session indefinitely (at least for a period on the order of days - I've left it over the weekend to see if I can connect back on Monday).

As for where the viewer gets stuck, it is before the authentication box pops up. I tried running with loglevel 150 one time when it was permanently frozen, and this is what is shown:

C:\Program Files (x86)\TurboVNC>vncviewer.bat ################:1 -loglevel 150
jawt.dll path: C:\Program Files (x86)\TurboVNC\java\jre\bin
Log setting: 150
main: start called
CConn: connected to host ################ port 5901
CConnection: reading protocol version

And that's it. It gets stuck on reading protocol version and never advances. The only way to get out of this state is by sshing into the server and killing the process with a kill -9 (running Xvnc --kill :1 doesn't work, and a simple kill is ignored).

I'm running TurboVNC 3.0.2 for both the Client (on windows) and Server (on linux)

So my question is:

dcommander commented 1 year ago

The only time I have ever seen anything remotely like this was with a client of mine who was using a stateful firewall that held onto the TurboVNC connection like a dog with a bone for some reason. TurboVNC, in and of itself, should never behave as you describe.

To answer your specific questions:

techdude commented 1 year ago

It could be stateful firewall related, but after the connection has failed, and since I can still get through to the server if I start another session on the server, it would seem to me that the firewall may not be the issue. When the client wait timer times out and the server drops the connection, does it reuse TCP sequence numbers for new connections? Has this client wait timer been tested to confirm that it works and hasn't regressed in newer versions? Because I don't recall having these same issues on the 2.x versions of TurboVnc, and that was back when I had a much worse network connection.

I've tried tcpkill, but that doesn't really help because that will just kill incoming connections, but it doesn't help communicate to the server that it should stop waiting for the previous client to disconnect. It really does seem like the server gets stuck in an infinite waiting loop of some sort, and during these crashes it sits there using >90% cpu usage (as reported by top) which is higher than it typically is even when under normal operation (typically 20-40% cpu usage).

dcommander commented 1 year ago

I'll try to reproduce the problem. I assume that you are connecting manually to a specific TurboVNC session, as opposed to using the Session Manager? Which security type are you using? Which Linux distribution?

The only thing that changed between 2.2.x and 3.0.x that might explain this is the adoption of new RFB flow control algorithms from TigerVNC. If you're comfortable getting your hands dirty with the code, here is a patch that reverts to the older algorithms used in TurboVNC 2.2.x. If that eliminates the problem, then at least we know where the issue is, but I still need to be able to reproduce the issue in order to fix it.

0001-Restore-TurboVNC-2.2.x-RFB-flow-control-algorithms.patch

techdude commented 1 year ago

I don't have control over the server installation so that will be a bit difficult, but I'll see what I can do.

One more bit of data, when I try to kill using vncserver -kill :1 I get the following message, which does make it seem more likely that it is vncserver that is getting stuck in an infinite loop, and not just some firewall or other piece of external networking:

Xvnc seems to be deadlocked.  Kill the process manually and then re-run
    /opt/TurboVNC/bin/vncserver -kill :1
to clean up the socket files.

The config file looks like everything is commented out, so I assume it's the default security types with simple password. If there is a way to check which security type is used, let me know.

The Linux distro is RedHat 7 Workstation. If there are particular package versions you need, let me know.

dcommander commented 1 year ago

I actually was able to duplicate the problem, but only when using a Windows client. I'm investigating.

techdude commented 1 year ago

That's awesome news!

I do see that there is a -noflowcontrol option for the server, so I might try running with that as well if that disables the new RFB flow control changes.

dcommander commented 1 year ago

I can't seem to reproduce the problem with the older flow control algorithms, so unfortunately that does seem to be the source of it. :( Since that's not my code, it may take me a while to diagnose.

dcommander commented 1 year ago

I was able to reproduce the issue with a macOS client as well. When the network connection drops, the TurboVNC Viewer will eventually time out and ask if you want to reconnect, but the TurboVNC Server hangs onto the connection for some reason. That is true regardless of whether the 2.2.x or 3.0.x flow control algorithms are used. It's just that the 3.0.x flow control algorithms apparently have a bug that causes an infinite loop when that happens, which ties up the TurboVNC X server and prevents the viewer from reconnecting. With the 2.2.x algorithms, the viewer can reconnect, but the original connection is still active for some reason. (So really there are two bugs at work here.) Note that, to make this occur reliably, I have to run an application (such as a video player, or vglrun -sp /opt/VirtualGL/bin/glxspheres64 if you have VirtualGL installed) that constantly draws something to the TurboVNC X server. This is because the viewer and the server can only detect network interruptions if they are trying to send or receive something. I'll keep you posted as I learn more about the issue.

dcommander commented 1 year ago

New data points:

techdude commented 1 year ago

Yeah, that matches with what I've been seeing as well. I suppose I've come to think if it pops up within 1-5 minutes that it's working fine so I've only really considered it truly frozen if it takes much longer to unfreeze.

dcommander commented 1 year ago

Should be fixed now in the latest pre-release builds. Please test it.

techdude commented 1 year ago

Thanks, I'll try out the pre-release build.

techdude commented 1 year ago

I've tried it out for several days now without issues. It looks like it is resolved now. Thanks for looking into this!

brandonbiggs commented 1 year ago

Did this make it into the 3.0.3 release or is it planned for a future release?

dcommander commented 1 year ago

@brandonbiggs Yes, it's in 3.0.3.