Ysurac / openmptcprouter

OpenMPTCProuter is an open source solution to aggregate multiple internet connections using Multipath TCP (MPTCP) on OpenWrt
https://www.openmptcprouter.com/
GNU General Public License v3.0
1.71k stars 252 forks source link

clients randomly lose connection when master or other WANs drop connection #3392

Open Fsoc1337 opened 3 weeks ago

Fsoc1337 commented 3 weeks ago

Expected Behavior :

4 Wans ( Starlink , 2 LTE , ADSL ) using any of the available proxies , and using GlorytunUDP for ICMP and UDP traffic.

Expected behavior is to have a failover and redundant connection , so that when one or more WANs lose internet connection , the clients can have connectivity without timeout (in TCP games like WoW)

Current Behavior :

it does aggregate in terms of bandwidth (TCP speedtest ) , but when i start testing and disconnecting internet from WANs , some connections drops randomly there 9 clients connected to the OMR router, all playing the same game in the same server , when i start intentionally disconnecting internet from the master WAN (or sometimes other WANs also ) , some of the clients get disconnected while some of them stay connected.
the clients that lose connection have to close that specific app and run it again to gain internet access or they won't be able to connect ( i remember the same things happen back in the old days with apps like proxifier ...)

it works perfectly when i set proxy to none and use Glorytun ( or other VPNs) for all traffic , it gives me a no-dc , instant failover connection as long as i have at least one connection up , but the problem with VPN is that the bandwidth output is extremely low , like it gives me around 15 Mb download and 2Mb upload .

i just need bandwidth like Proxies , redundancy and instant failover switch like VPNs , both at the same time 😅

Did you checked log ? In Status->System log

i did but didn't catch anything ( not sure if i understand logs correctly)

Specifications :

Fsoc1337 commented 3 weeks ago

the main Ethernet port of motherboard is conncted to a switch, and for all the WANs i have a physical ethernet attached to mothetboard's PCI slot
not using Vlan , using Normal mode for WANs

Ysurac commented 3 weeks ago

I would need a screenshot of System->OpenMPTCProuter, "Status" page

Fsoc1337 commented 3 weeks ago

right now i'm using GlorytunUDP , works flawlessly but the bandwidth is super low , speedtest shows 18 Mb dl and 2 Mb ul . while starlink alone can achieve more than 200 Mb

vempire-ghost commented 3 weeks ago

I am using OMR exclusively for gaming, especially for playing WoW for the past year. Various specific configurations were needed to achieve a seamless failover. I don't know if it will work for you, but I will share my considerations.

The best scheduler for gaming is the redundant one. It ensures that each packet goes through all WANs simultaneously, and the one that arrives first is delivered to the game server. This way, you get the best possible latency and a guarantee of no disconnection. However, this scheduler is currently only available in the 5.4 kernel. Using the redundant scheduler also removes speed aggregation.

Regarding proxy and VPN, using the proxy resulted in much more stable latency compared to VPN, specifically using xray vless/vmess. For VPN, the best result for me was with Glorytun TCP. The UDP version, when there was packet loss on one of the WANs, performed poorly and caused many latency spikes.

Using these settings and other more specific configurations, I achieved a very high level of seamless failover. Even in the most aggressive tests where I disconnected all WANs except one, and kept connecting and disconnecting them one by one, I didn't experience a single latency spike, lag, or disconnection.

I hope you can achieve this goal too :)

Fsoc1337 commented 3 weeks ago

I am using OMR exclusively for gaming, especially for playing WoW for the past year. Various specific configurations were needed to achieve a seamless failover. I don't know if it will work for you, but I will share my considerations.

The best scheduler for gaming is the redundant one. It ensures that each packet goes through all WANs simultaneously, and the one that arrives first is delivered to the game server. This way, you get the best possible latency and a guarantee of no disconnection. However, this scheduler is currently only available in the 5.4 kernel. Using the redundant scheduler also removes speed aggregation.

Regarding proxy and VPN, using the proxy resulted in much more stable latency compared to VPN, specifically using xray vless/vmess. For VPN, the best result for me was with Glorytun TCP. The UDP version, when there was packet loss on one of the WANs, performed poorly and caused many latency spikes.

Using these settings and other more specific configurations, I achieved a very high level of seamless failover. Even in the most aggressive tests where I disconnected all WANs except one, and kept connecting and disconnecting them one by one, I didn't experience a single latency spike, lag, or disconnection.

I hope you can achieve this goal too :)

hi vempire , thanks for sharing your experiences with me

i got confused with something , you said you are using version 5.4 and the redundant mode , but later you mentioned using xray vless/vmess for better latency , as far as i know there is no xray vless/vmess in version 5.4 , am i right ?

vempire-ghost commented 3 weeks ago

In version 0.60 5.4 legacy of OMR, it has all the same proxy and VPN options as in version 0.60 6.1.

Fsoc1337 commented 3 weeks ago

In version 0.60 5.4 legacy of OMR, it has all the same proxy and VPN options as in version 0.60 6.1.

oh ! thanks , i must have mistaken it with older version probably .

i will give it a try , thanks

Fsoc1337 commented 3 weeks ago

I am using OMR exclusively for gaming, especially for playing WoW for the past year. Various specific configurations were needed to achieve a seamless failover. I don't know if it will work for you, but I will share my considerations.

The best scheduler for gaming is the redundant one. It ensures that each packet goes through all WANs simultaneously, and the one that arrives first is delivered to the game server. This way, you get the best possible latency and a guarantee of no disconnection. However, this scheduler is currently only available in the 5.4 kernel. Using the redundant scheduler also removes speed aggregation.

Regarding proxy and VPN, using the proxy resulted in much more stable latency compared to VPN, specifically using xray vless/vmess. For VPN, the best result for me was with Glorytun TCP. The UDP version, when there was packet loss on one of the WANs, performed poorly and caused many latency spikes.

Using these settings and other more specific configurations, I achieved a very high level of seamless failover. Even in the most aggressive tests where I disconnected all WANs except one, and kept connecting and disconnecting them one by one, I didn't experience a single latency spike, lag, or disconnection.

I hope you can achieve this goal too :)

you mentioned some more specific settings and configurations earlier , may i ask what are those? what other suggestions do you have to achieve my goal ? thanks in advance

vempire-ghost commented 3 weeks ago

Here are a few details, for example, if you use the proxy, OMR doesn't handle the loss of connection on the master WAN well. To avoid this, I created a special WAN behind a failover router to ensure that the master connection always has an available internet connection.

Another issue is that in kernel 5.4 there is a bug in MPTCP where sometimes the subflow can expire or terminate due to timeout, and when the internet returns on that WAN, it is not recreated unless the multipath is turned off and on again in the settings for that WAN. So, I created a script that runs every minute to turn off and on the multipath to ensure that if a subflow has expired, it will be recreated. Example of the script I use in OMR's cronjob:

*/1 * * * * multipath eth2 off ; sleep 1 ; multipath eth2 on ; sleep 5 ;multipath eth3 off ; sleep 1 ; multipath eth3 backup ; sleep 5 ; multipath eth4 off ; sleep 1 ; multipath eth4 on ; sleep 5 ; multipath eth3 on ; multipath eth5 off ; sleep 1 ; multipath eth5 on ; multipath eth3 backup.

I don't include the master WAN in this script because if a connection is initiated during the interval when the multipath is off, it will be created as a simple TCP connection and will not have the capability for subflows even after the multipath is turned back on.

I also changed the congestion control to wvegas, which worked best for me.

I noticed that having my VPS physically close to the game server is better as it ensures a multipath connection closer to the game server. If your VPS is near your home, the path between it and the game server will be all in regular TCP, and if there is a problem on this path, there will be a disconnection.

Because of the redundant scheduler, I avoid routing any non-game traffic through the OMR to prevent overloading the WANs with unnecessary data. To achieve this, I use Proxifier to direct only the traffic from the game executables to the OMR WAN, while all other traffic goes through a normal WAN.

Yessine1200 commented 3 weeks ago

I am using OMR exclusively for gaming, especially for playing WoW for the past year. Various specific configurations were needed to achieve a seamless failover. I don't know if it will work for you, but I will share my considerations.

The best scheduler for gaming is the redundant one. It ensures that each packet goes through all WANs simultaneously, and the one that arrives first is delivered to the game server. This way, you get the best possible latency and a guarantee of no disconnection. However, this scheduler is currently only available in the 5.4 kernel. Using the redundant scheduler also removes speed aggregation.

Regarding proxy and VPN, using the proxy resulted in much more stable latency compared to VPN, specifically using xray vless/vmess. For VPN, the best result for me was with Glorytun TCP. The UDP version, when there was packet loss on one of the WANs, performed poorly and caused many latency spikes.

Using these settings and other more specific configurations, I achieved a very high level of seamless failover. Even in the most aggressive tests where I disconnected all WANs except one, and kept connecting and disconnecting them one by one, I didn't experience a single latency spike, lag, or disconnection.

I hope you can achieve this goal too :)

@vempire-ghost Hello vempire, How exactly did you activate the redundant scheduler in OMR v6.0 5.4? Did you have to compile another image of OMR with your modifications, even when you wanted to write a script?

vempire-ghost commented 3 weeks ago

I am using OMR exclusively for gaming, especially for playing WoW for the past year. Various specific configurations were needed to achieve a seamless failover. I don't know if it will work for you, but I will share my considerations. The best scheduler for gaming is the redundant one. It ensures that each packet goes through all WANs simultaneously, and the one that arrives first is delivered to the game server. This way, you get the best possible latency and a guarantee of no disconnection. However, this scheduler is currently only available in the 5.4 kernel. Using the redundant scheduler also removes speed aggregation. Regarding proxy and VPN, using the proxy resulted in much more stable latency compared to VPN, specifically using xray vless/vmess. For VPN, the best result for me was with Glorytun TCP. The UDP version, when there was packet loss on one of the WANs, performed poorly and caused many latency spikes. Using these settings and other more specific configurations, I achieved a very high level of seamless failover. Even in the most aggressive tests where I disconnected all WANs except one, and kept connecting and disconnecting them one by one, I didn't experience a single latency spike, lag, or disconnection. I hope you can achieve this goal too :)

@vempire-ghost Hello vempire, How exactly did you activate the redundant scheduler in OMR v6.0 5.4? Did you have to compile another image of OMR with your modifications, even when you wanted to write a script?

In version 5.4, the redundant scheduler is a standard option. You just need to download it from the OMR site and install it normally. In the MPTCP options, you will have the option to choose "redundant" as shown in the figure below. image

I didn't understand the question about the script.

Yessine1200 commented 3 weeks ago

@vempire-ghost Thank you for your response. You mentioned that you wrote a script for when the subflow can expire or terminate due to timeout. How did you do that?

vempire-ghost commented 3 weeks ago

@vempire-ghost Thank you for your response. You mentioned that you wrote a script for when the subflow can expire or terminate due to timeout. How did you do that?

When I conducted tests similar to those by the creator of this thread, intentionally disconnecting WANs, I noticed that interrupted data flows were not automatically resumed when the WAN was reconnected. Through trial and error, I discovered that turning multipath off and on again would resume the flow and recreate the subflow.

Later, I found out about the problem in this issue: https://github.com/multipath-tcp/mptcp/issues/153, where they reached the same conclusion. Although it appears to have been fixed, the problem persists in the latest version.

The script itself is simple. It just turns off the multipath in an orderly manner, one WAN at a time, waits 1 second, and turns it back on, thus recreating the subflows that might have been interrupted. This 1-second wait between turning off and on is crucial for the subflow to be recreated.

Unfortunately, I can't determine if a subflow has broken and needs to be recreated, so the script runs every minute.

Fsoc1337 commented 2 weeks ago

Here are a few details, for example, if you use the proxy, OMR doesn't handle the loss of connection on the master WAN well. To avoid this, I created a special WAN behind a failover router to ensure that the master connection always has an available internet connection.

Another issue is that in kernel 5.4 there is a bug in MPTCP where sometimes the subflow can expire or terminate due to timeout, and when the internet returns on that WAN, it is not recreated unless the multipath is turned off and on again in the settings for that WAN. So, I created a script that runs every minute to turn off and on the multipath to ensure that if a subflow has expired, it will be recreated. Example of the script I use in OMR's cronjob:

*/1 * * * * multipath eth2 off ; sleep 1 ; multipath eth2 on ; sleep 5 ;multipath eth3 off ; sleep 1 ; multipath eth3 backup ; sleep 5 ; multipath eth4 off ; sleep 1 ; multipath eth4 on ; sleep 5 ; multipath eth3 on ; multipath eth5 off ; sleep 1 ; multipath eth5 on ; multipath eth3 backup.

I don't include the master WAN in this script because if a connection is initiated during the interval when the multipath is off, it will be created as a simple TCP connection and will not have the capability for subflows even after the multipath is turned back on.

I also changed the congestion control to wvegas, which worked best for me.

I noticed that having my VPS physically close to the game server is better as it ensures a multipath connection closer to the game server. If your VPS is near your home, the path between it and the game server will be all in regular TCP, and if there is a problem on this path, there will be a disconnection.

Because of the redundant scheduler, I avoid routing any non-game traffic through the OMR to prevent overloading the WANs with unnecessary data. To achieve this, I use Proxifier to direct only the traffic from the game executables to the OMR WAN, while all other traffic goes through a normal WAN.

where and how can i implement the script ? can you guide me through it please ?

vempire-ghost commented 2 weeks ago

Here are a few details, for example, if you use the proxy, OMR doesn't handle the loss of connection on the master WAN well. To avoid this, I created a special WAN behind a failover router to ensure that the master connection always has an available internet connection. Another issue is that in kernel 5.4 there is a bug in MPTCP where sometimes the subflow can expire or terminate due to timeout, and when the internet returns on that WAN, it is not recreated unless the multipath is turned off and on again in the settings for that WAN. So, I created a script that runs every minute to turn off and on the multipath to ensure that if a subflow has expired, it will be recreated. Example of the script I use in OMR's cronjob:

*/1 * * * * multipath eth2 off ; sleep 1 ; multipath eth2 on ; sleep 5 ;multipath eth3 off ; sleep 1 ; multipath eth3 backup ; sleep 5 ; multipath eth4 off ; sleep 1 ; multipath eth4 on ; sleep 5 ; multipath eth3 on ; multipath eth5 off ; sleep 1 ; multipath eth5 on ; multipath eth3 backup.

I don't include the master WAN in this script because if a connection is initiated during the interval when the multipath is off, it will be created as a simple TCP connection and will not have the capability for subflows even after the multipath is turned back on. I also changed the congestion control to wvegas, which worked best for me. I noticed that having my VPS physically close to the game server is better as it ensures a multipath connection closer to the game server. If your VPS is near your home, the path between it and the game server will be all in regular TCP, and if there is a problem on this path, there will be a disconnection. Because of the redundant scheduler, I avoid routing any non-game traffic through the OMR to prevent overloading the WANs with unnecessary data. To achieve this, I use Proxifier to direct only the traffic from the game executables to the OMR WAN, while all other traffic goes through a normal WAN.

where and how can i implement the script ? can you guide me through it please ?

You just need to place it here on this screen, but first you have to adapt it to your reality, the WANs you have, and the configuration you chose for them. Copying and pasting mine will not yield a good result. image

Brazzo978 commented 2 weeks ago

Here are a few details, for example, if you use the proxy, OMR doesn't handle the loss of connection on the master WAN well. To avoid this, I created a special WAN behind a failover router to ensure that the master connection always has an available internet connection.

Another issue is that in kernel 5.4 there is a bug in MPTCP where sometimes the subflow can expire or terminate due to timeout, and when the internet returns on that WAN, it is not recreated unless the multipath is turned off and on again in the settings for that WAN. So, I created a script that runs every minute to turn off and on the multipath to ensure that if a subflow has expired, it will be recreated. Example of the script I use in OMR's cronjob:

*/1 * * * * multipath eth2 off ; sleep 1 ; multipath eth2 on ; sleep 5 ;multipath eth3 off ; sleep 1 ; multipath eth3 backup ; sleep 5 ; multipath eth4 off ; sleep 1 ; multipath eth4 on ; sleep 5 ; multipath eth3 on ; multipath eth5 off ; sleep 1 ; multipath eth5 on ; multipath eth3 backup.

I don't include the master WAN in this script because if a connection is initiated during the interval when the multipath is off, it will be created as a simple TCP connection and will not have the capability for subflows even after the multipath is turned back on.

I also changed the congestion control to wvegas, which worked best for me.

I noticed that having my VPS physically close to the game server is better as it ensures a multipath connection closer to the game server. If your VPS is near your home, the path between it and the game server will be all in regular TCP, and if there is a problem on this path, there will be a disconnection.

Because of the redundant scheduler, I avoid routing any non-game traffic through the OMR to prevent overloading the WANs with unnecessary data. To achieve this, I use Proxifier to direct only the traffic from the game executables to the OMR WAN, while all other traffic goes through a normal WAN.

hi could you explain more about that bug , because it seems like what i am trying to fix since 0.60 is out , thanks

vempire-ghost commented 2 weeks ago

hi could you explain more about that bug , because it seems like what i am trying to fix since 0.60 is out , thanks

The bug is described in this issue: https://github.com/multipath-tcp/mptcp/issues/153. It only affects kernel 5.4. Although it is marked as fixed, it still occurs frequently.

In summary, when a WAN loses internet connectivity, the related subflow is terminated. However, when connectivity is restored, the subflow should be recreated, but sometimes it isn't, so the flow is not reestablished.

What was discovered is that by deactivating and reactivating the multipath on the WAN, the subflow is reestablished. If this is the issue, the script works as a workaround.

Brazzo978 commented 2 weeks ago

oh ok i get the problem but killing each connection every X time is not optimal , there is the risk to create problem i think

vempire-ghost commented 2 weeks ago

oh ok i get the problem but killing each connection every X time is not optimal , there is the risk to create problem i think

Yes, unfortunately, it is a solution that may not be ideal. I have been using it daily for several months, and so far, I haven't had any problems caused by it in my specific case.

In my tests using OMR for games, this solution caused small latency spikes when used with the default scheduler. However, when used with the redundant scheduler, I didn't notice any latency fluctuations, and every time there was a subflow break, it was promptly reestablished, achieving what I was aiming for.

Yessine1200 commented 2 weeks ago

@vempire-ghost and with the redundant scheduler, you still have the full aggregated bandwidth ?

vempire-ghost commented 2 weeks ago

@vempire-ghost and with the redundant scheduler, you still have the full aggregated bandwidth ?

No, the redundant scheduler does not aggregate speed. Since it sends each packet over all WANs simultaneously, it is designed for resilience. The maximum speed will be the highest speed of one of the WANs. If the goal is to increase speed, the redundant scheduler will not be useful.

Brazzo978 commented 2 weeks ago

@vempire-ghost as pointed out by ysurac kernel 6.6 snap does have redundant scheduler back , ill try and report if its similar to 5.4 in performance

vempire-ghost commented 2 weeks ago

@vempire-ghost as pointed out by ysurac kernel 6.6 snap does have redundant scheduler back , ill try and report if its similar to 5.4 in performance

That's great. I'll be looking forward to your observations about the redundant scheduler in 6.6. I'll wait for the stable release to test it.

Yessine1200 commented 1 week ago

@vempire-ghost Just had an idea: did you use OMR-DSCP to prioritize your traffic for your game? You can prioritize your traffic on the best link you have, use the default scheduler, and still achieve full-speed aggregation.

vempire-ghost commented 1 week ago

@vempire-ghost Just had an idea: did you use OMR-DSCP to prioritize your traffic for your game? You can prioritize your traffic on the best link you have, use the default scheduler, and still achieve full-speed aggregation.

I don't need to aggregate speed because my main connection already has good speed. What I need is guaranteed resilience against drops, latency, or packet loss, and the redundant scheduler is perfect for this. It always ensures that the packet arrives via the best route at that millisecond. When I use the default scheduler and the best WAN route is poor, it reflects in the ping as the scheduler will have to resend the lost packet. With the redundant scheduler, the packet is sent through all WANs, ensuring it arrives on at least one of them.