erebe / wstunnel

Tunnel all your traffic over Websocket or HTTP2 - Bypass firewalls/DPI - Static binary available
Other
3.22k stars 290 forks source link

Multiple WireGuard connections through the same wstunnel client<>server #179

Closed yu-james closed 6 months ago

yu-james commented 6 months ago

Describe the bug This issue is a bit tricky. It is when multiple WireGuard clients connect through the same tunnel to the WireGuard server.

What is working: 2 WireGuard clients -> 2 wstunnel clients -> wstunnel server -> WireGuard server

Basically, each WG client connects to its own wstunnel client

What is not working: 2 WireGuard clients -> 1 wstunnel clients -> wstunnel server -> WireGuard server

When the 2 WG clients connect to the same wstunnel client, initially both WG clients will handshake. Then within seconds, both WG connections will have very high packet dropping rates, almost unusable connections. Ping packet loss is about 75%+

There is hardly any useful information in the trace (with RUST_LOG=trace).

No error is reported at wstunnel client or server.

yu-james commented 6 months ago

I understand this is a tricky issue that might indicate some thread/connection interferences related to UDP connections. If other information can be helpful, I am happy to provide.

Also note as per 'What is working' section, there is a known workaround.

erebe commented 6 months ago

Thanks for reporting it :) Would you mind trying release https://github.com/erebe/wstunnel/releases/tag/v7.9.0, it should fix your issue ! Let me know if it is the case

yu-james commented 6 months ago

Tested and confirmed working. Also tested throughput and can confirm there is no side-effect.

Man I can't imagine how you are able to pinpoint and fix the issue with so limited information! Big thank you!

yu-james commented 6 months ago

Sorry have to re-open this issue.

I have tested the original scenario: 2 WireGuard clients -> 1 wstunnel clients -> wstunnel server -> WireGuard server

It worked for a while. Then after about 20 mins and total of 300MB data transferred in both connection, the wstunnel client is frozen.

Frozen as there is no message of 'Opening TCP connection to "server name"' when WG is manually reactivated. And also existing connections are hung. Pings timed out.

wstunnel server seems to be intact, as I then immediately launched a new instance of wstunnel client and again, both WG connections worked at the beginning then froze after a while.

Checked memory footprint of the executable, dead processes are at 4M mark, and the active at 6M mark. Again, no error messages.

erebe commented 6 months ago

Ok, this one is weird, I don't see any obvious reason why it behaves like this. On the client side are you correctly disabling the timeout -L udp://...?timeout_sec=0 ?

yu-james commented 6 months ago

Ok, this one is weird, I don't see any obvious reason why it behaves like this. On the client side are you correctly disabling the timeout -L udp://...?timeout_sec=0 ?

Thanks for the reply. Let me try this.

Also after many tries, it seems to me that it is more likely to happen when both clients open a good number of connections, like opening and loading several webpages at the same time. This doesn't make sense to me though. As far as I am aware, WG doesn't pass on multi-threading into many UDP connections. Plus it is a bit random, sometimes takes long time and still not frozen. Sometimes very quickly.

Anyway, will do a bit of trying with your recommendation. Let's see.

Is there any past issue for me to read any background of 'disabling the timeout -L udp://...?timeout_sec=0'?

And as usual, thank you very much!

erebe commented 6 months ago

Not really any past issue. It is just that as UDP is connection less, there is no way to know if the udp peer has stopped sending data or not. So to avoid leaking connection, wstunnel has a default timeout of 30sec, after which it forces close the connection/tunnel.

In the case of VPN traffic, the peer is never going to stop sending traffic, so disabling this force timeout is a must-do, because else you spend destroying and re-creating a tunnel for the same UDP flow, which will cause some havoc server side.

(I have added the mention in the Wireguard with wstunnel section)

yu-james commented 6 months ago

Thanks for the feedback. Again after many tries, it works slightly better with "-L udp://...?timeout_sec=0". The symptom has changed and I am now able to consistently reproduce it.

The environment: System A: WG server (Docker in Linux) System B: wstunnel server (Docker in CloudFoundry) System C: wstunnel client + WG ClientA (Windows 11) System D: WG ClientB connects to wstunnel client on System C (Android WG client)

Put both WG clients to stream video (at about 1MB/s). WG ClientA works flawlessly whereas WG ClientB will soon get frozen, in about a minute). However it can be recovered by reconnecting WB but, still will freeze in a minute. It is always ClientB, regardless which client connects first.

I have fallen back to version 6 to test and can confirm there is not a same issue in the old version.

As I am not able to capture any error message again, at wstunnel server or wstunnel client, nor WG client, I would imagine it will be a bit difficult to trace. I can provide you a testing instance including System A&B if helpful? In such case, let me know and let's try to find a secure way to drop you the config files.

erebe commented 6 months ago

I haven't re-done the whole setup, but I tested with iperf3, and found that the client can miss some packet because of the host dropping them due to the receive buffer being too low.

Would you mind trying the new release https://github.com/erebe/wstunnel/releases/tag/v7.9.1 and let me know if it helps ?

If not, I will revert to each udp stream having its own buffer, it is more resilient, but throughput is not that great.

yu-james commented 6 months ago

Will do the testing and feedback later. Again thanks a lot!

yu-james commented 6 months ago

Well... This might sound weird but after I updated both server and client to 7.9.1, WG client stopped working completely. I can see 'sent' number growing but 'received' always 0.

Reverted the client back to 7.9.0 and WG connection is restored. I am trying everything else now. Just mentioning this in case it rings any bell.

erebe commented 6 months ago

Ok sorry, I left some debug code in the release. Doing a new 7.9.1 right now, you can re-test in ~10min.

erebe commented 6 months ago

Should be good, sorry about that.

yu-james commented 6 months ago

No worries!

Good news is the connection is now back. Will do the throughput testing later. Just a thinking in regards to your earlier comment around increased buffer vs. independent buffer, does it mean the former might have a limit on the number of UDP streams supported?

P.S. Looks like you need some coffees. Just sending across a couple more... ;-) Thank you again.

erebe commented 6 months ago

It is not much a limitation around the number of clients, but more a limitation of throughput.

With UDP, the server is only having a single socket to receive the data of all clients. At the difference of TCP where you have a socket (and associated buffers) per client.

So the rule of the game, is to dequeue messages fast enough before your UDP kernel buffers get full, and it starts dropping packets. Kernel can do that without warning, because there is no guarantee with UDP.

The first implementation (and also the Haskell one) had internal buffer per UDP client. It makes it fast to dequeue the UDP packets from the kernel buffer, but is more CPU involving and make a ceiling to the throughput because it has to copy the data twice.

With the current implementation, there is no more internal buffer, each client is driving the server socket, but it makes it slower to dequeue packet because pushing data into the tunnel take comparatively more wait time, than copying data. So there can be some starving if there are too many clients and one is slow to push data. After that if the kernel buffer gets full, it starts dropping some packets.

There are some ways around that, mainly the trick described in this article https://blog.cloudflare.com/everything-you-ever-wanted-to-know-about-udp-sockets-but-were-afraid-to-ask-part-1/, but it is linux only sadly :x

If you confirm me that your issue is solved, I will try to see if I can't rework the code to alleviate the issue.

P.s: Thank you for the coffee :)

yu-james commented 6 months ago

Hey thanks for fixing it, and taking time to explain what's under the hood! It makes much more sense after I read paragraph 5.

I ran a stress test at the top of my currently available bandwidth for 30mins: 3x 4K video streaming over 2 WG clients, or 10MB/s. It was perfect! Thank you!

I am leaving the issue open so you can see it in the list. But please feel free to close.

yu-james commented 6 months ago

FYI only. There are some of this error message, about 1 per hour: ERROR tunnel{id="018b9390-2883-735d-a728-2ce2698e5810" remote="remote server name"}: wstunnel::tunnel::io: error while reading from websocket rx Reserved bits are not zero

Nothing seems to be broken though.

erebe commented 6 months ago

Awesome :) I will close the issue then

Thank you for having taking the time to report those issues, I would have missed them otherwise !

Also, thank you again for all those coffees !

P.s: for while reading from websocket rx Reserved bits are not zero it should be ok, and re-create the tunnel