Failover doesn't recover router state properly: After VPS connection issue, OMR switches from direct output back to VPS, all traffic gets stuck on VPN.

Ysurac / openmptcprouter

OpenMPTCProuter is an open source solution to aggregate multiple internet connections using Multipath TCP (MPTCP) on OpenWrt

https://www.openmptcprouter.com/

GNU General Public License v3.0

1.86k stars 269 forks source link

Failover doesn't recover router state properly: After VPS connection issue, OMR switches from direct output back to VPS, all traffic gets stuck on VPN. #2958

Closed ioogithub closed 10 months ago

ioogithub commented 1 year ago

After VPS is rebooted and connection to VPS is restored OMR should properly restore the connection to use the proxy not the VPN.

Current Behavior

Currently, OMR will switch to VPN (glorytun) and get stuck on glorytun. It does not restore traffic flow to proxy. Only rebooting OMR or validating the wizard (System->Openmptcp->Setitngs wizard) will restore traffic flow to proxy.

Possible Solution

If VPS is restarted this will happen:

OMR detects there is a problem with the VPS
OMR switches to direct single connection (master MPTCP node)
When VPS is back online, OMR switches back to VPS but it doesn't use proxy.

OMR knows that the VPS has had a connection issue because it can change the path but when it restores back to VPS it uses the VPN (glorytun) and gets stuck on VPN.

Steps to Reproduce the Problem

Using v2ray as proxy and glorytun as VPN.
Have OMR and VPS in a known good state:
Verify all green checks on the status page.
Observe the VPN and proxy traffic values on status page
Run a speed test
Observe the VPN and proxy traffic values, verify that the speedtest used the proxy.
Restart VPS
Wait for OMR to show green again on status page.
Repeat steps 6-8
Result: traffic increases on VPN (90Mb), traffic on proxy stays very low (5k).
Confirm that v2ray is running, accepting connections but OMR will not send traffic to it.

@Ysurac can you tell me:

How can I run the same command as (Settings Wizard->Save And apply) from the command line?
The ideal solution is to have OMR fix itself and restore the proxy state properly after connection loss to VPS.
If it cannot be replicated or fixed, I will try to find a work around to detect VPS connection problem and run a command that simulates (Settings Wizard->Save And apply) from a script to fix the issue temporarily.

Context (Environment)

This is a major problem because glorytun performance is so bad it is unusable (200-400k). It is so painful to even login to VPS with SSH and try to troubleshoot the problem in this state. It would be much better to stay in direct connection mode until proxy traffic can be restored. Glorytun is so slow that it is very important that OMR can recover itself properly to proxy traffic.

Specifications

OpenMPTCProuter version: v0.59.1-5.4 r0+16594-ce92de8c8c
OpenMPTCProuter VPS version: VPS 0.1028
OpenMPTCProuter VPS provider:
OpenMPTCProuter platform: x86_64)

ioogithub commented 1 year ago

I first encountered the issue here : Originally posted by @ioogithub in https://github.com/Ysurac/openmptcprouter/issues/2931#issuecomment-1689176023

@Ysurac, you mentioned that this script should fix the issue: Originally posted by @Ysurac in https://github.com/Ysurac/openmptcprouter/issues/2931#issuecomment-1690148743

I will test it and see if it works. If it does, can this be run automatically when OMR switches from direct output back to vps output so that OMR can failover and recover itself properly without any manual user intervention?

Ysurac commented 1 year ago

I can't reproduce the issue in latest snapshot. The comment was on traffic from VPS to router side, if I understand correctly, here it's traffic from router to VPS side.

ioogithub commented 1 year ago

The comment was on traffic from VPS to router side, if I understand correctly, here it's traffic from router to VPS side.

I am not sure what you mean VPS to router or router to VPS. From the issue, it is when VPS is restarted or connection problem. on Router I see it switches to direct connection (wan1). After VPS comes back up it switched back to aggreate but using VPN (glorytun) only, not using proxy.

I will test it again now and try to recover with /etc/init.d/openmptcprouter-vps restart If this works then at least there is a manual way to recover from this issue. Which script is responsible for restoring the proxy connection after VPN comes back up?

ioogithub commented 1 year ago

I can reproduced this twice. Look at the proxy and VPN values after restarting VPS in steps 5,6,7 to see the problem:

OMR working normally, green on status page, TCP and UDP traffic uses proxy, ICMP other small stuff uses VPN:
sudo restart now on VPS
OMR detects VPS is down
OMR switches to direct mode
OMR detects VPS is back up. All traffic is now using VPN only, run a speed test, look at proxy and VPN values:
- VPN increases
- proxy does not increase
run /etc/init.d/openmptcprouter-vps restart Test with speedtest, look at proxy and VPN traffic:
- VPN increases
- proxy does not increase
- didn't work, traffic still stuck on proxy.
- No change, no messages in the log.
Save and Apply on wizard page, run speedtest, look at VPN and proxy values:
- Traffic is going out of proxy again.
- OMR is restored to normal operation.

I tested this twice, it is reproducible. It would be better for OMR to stay in Direct output mode rather than changing to VPN only because internet is unusable with glorytun, less than 1Mb/s. OMR will stay in this crippled state until the user manually restarts OMR or validates wizard.

What exact script does Save and Apply run. This does not work: /etc/init.d/openmptcprouter-vps restart I want to try to recover from this using a command to see if I can fix the issue automatically.

Ysurac commented 1 year ago

As I said, I can't reproduce this on snapshot, so this will be fixed in next release.

ioogithub commented 1 year ago

As I said, I can't reproduce this on snapshot, so this will be fixed in next release.

What about now, this issue exists in the latest stable release. I understand you only want to to develop the new alpha versions but the last stable release had this problem and its a big problem. OMR doesn't recover from a vps connection issue and leaves the router in an unusable state. The only solution is for the user to manually restart it.

I have time to troubleshoot the issue now. I can do tests and analyze logs but can you give me some basic information? You said "The openmptcprouter-vps script synchronize the router and VPS config. /etc/init.d/openmptcprouter-vps restart" in the other issue but it did not work. When I run this script I get no results in the log.

What script runs when I click Save and Apply on the wizard page? This is the only know way to fix the problem and it always works. If I know this, then I can find a solution to automatically fix the problem.

Ysurac commented 1 year ago

After Save and apply, all services are restarted. You can try a /etc/init.d/v2ray restart if this don't work, try a /etc/init.d/omr-tracker restart

ioogithub commented 1 year ago

I tried restarting v2ray already it did not work.

Save and apply = All services restarted
I guess all services are not restated after omr switches from direct connection back to VPS?

Maybe the problem is this: If VPN is down OMR switches all traffic to single wan. When VPS is restored, OMR switches all traffic to glorytun and does not switch back to proxy. OMR status shows proxy is running but does not want to use it.

During normal operation proxy and VPN are running at the same time. What tells OMR to use the proxy and not use the VPN or TCP and UDP traffic? Routing rules or something else?

This is what I have tried so far, did not work: /etc/init.d/openmptcprouter-vps restart /etc/init.d/openmptcprouter-vps restart /etc/init.d/v2ray restart

I will try this next: /etc/init.d/omr-tracker restart /etc/init.d/glorytun restart

Thank you.

Ysurac commented 1 year ago

A list of all restarted services can be found here: https://github.com/Ysurac/openmptcprouter-feeds/blob/v0.59.1/luci-app-openmptcprouter/luasrc/controller/openmptcprouter.lua#L1009

ioogithub commented 1 year ago

Thank you. This list is very helpful I will try to isolate the problem.

In your test on snapshot, did you wait for OMR to failover to direct connection? I think this is the key. I just did a vps restart and it restarted very quickly, OMR didn't actually switchover to single wan direct connection. On the status page I never saw direct connection. In this test when VPS came back up it was still using proxy.

In the other tests, VPS took longer to restart and OMR switches to direct connection. When it switches back to VPS from direct connection state that is when there is a problem with the proxy traffic.

ioogithub commented 1 year ago

First test. I ran this command: /etc/init.d/glorytun restart >/dev/null 2>/dev/null"

here is the log result:

Thu Sep 14 16:28:27 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Thu Sep 14 16:28:27 2023 daemon.err glorytun[10376]: vpsip.65001: shutdown
Thu Sep 14 16:28:27 2023 daemon.info glorytun[10376]: STOPPED tun0
Thu Sep 14 16:28:27 2023 daemon.notice netifd: Network device 'tun0' link is down
Thu Sep 14 16:28:27 2023 daemon.notice netifd: Interface 'omrvpn' has link connectivity loss
Thu Sep 14 16:28:27 2023 daemon.notice netifd: Interface 'omrvpn' is now down
Thu Sep 14 16:28:27 2023 daemon.notice netifd: Interface 'omrvpn' is disabled
Thu Sep 14 16:28:27 2023 daemon.info glorytun[5671]: INITIALIZED tun0
Thu Sep 14 16:28:27 2023 daemon.notice netifd: Interface 'omrvpn' is enabled
Thu Sep 14 16:28:27 2023 daemon.notice netifd: Network device 'tun0' link is up
Thu Sep 14 16:28:27 2023 daemon.notice netifd: Interface 'omrvpn' has link connectivity
Thu Sep 14 16:28:27 2023 daemon.notice netifd: Interface 'omrvpn' is setting up now
Thu Sep 14 16:28:27 2023 daemon.notice netifd: Interface 'omrvpn' is now up
Thu Sep 14 16:28:28 2023 daemon.warn [6974]: <warn>  [SQM_IFB_a12cb] invalid sysfs path read for net/SQM_IFB_a12cb
Thu Sep 14 16:28:28 2023 daemon.warn [6974]: <warn>  [base-manager] couldn't handle kernel event: device net/SQM_IFB_a12cb not found
Thu Sep 14 16:28:28 2023 daemon.warn [6974]: <warn>  [SQM_IFB_c6ec0] invalid sysfs path read for net/SQM_IFB_c6ec0
Thu Sep 14 16:28:28 2023 daemon.warn [6974]: <warn>  [base-manager] couldn't handle kernel event: device net/SQM_IFB_c6ec0 not found
Thu Sep 14 16:28:28 2023 user.notice firewall: Reloading firewall due to ifup of omrvpn (tun0)
Thu Sep 14 16:28:28 2023 user.notice firewall.omr-server: Firewall reload, set server part firewall reloading
Thu Sep 14 16:28:28 2023 user.notice mptcp: Reloading mptcp config due to ifup of omrvpn (tun0)
Thu Sep 14 16:28:28 2023 daemon.err glorytun[5671]: vpsip.65001: connected
Thu Sep 14 16:28:28 2023 daemon.info glorytun[5671]: STARTED tun0
Thu Sep 14 16:28:29 2023 user.notice MPTCP: Flush route cache
Thu Sep 14 16:28:29 2023 user.notice post-tracking-post-tracking: Set firewall on server vps
Thu Sep 14 16:28:30 2023 user.notice v2ray: Rules UP
Thu Sep 14 16:28:30 2023 user.notice v2ray: v2ray-rules -l 1897 -L 1897 -s vpsip --rule-name def --src-default forward --dst-default forward --local-default forward
Thu Sep 14 16:28:30 2023 user.notice v2ray: Reload omr-bypass rules
Thu Sep 14 16:28:30 2023 user.notice omr-bypass: Starting OMR-ByPass...
Thu Sep 14 16:28:31 2023 user.notice post-tracking-post-tracking: Tunnel up : Replace default route by 10.255.255.1 dev tun0
Thu Sep 14 16:28:33 2023 user.notice omr-bypass: Reload dnsmasq...
Thu Sep 14 16:28:33 2023 daemon.info dnsmasq[1]: read /etc/hosts - 4 addresses
Thu Sep 14 16:28:33 2023 daemon.info dnsmasq[1]: read /tmp/hosts/dhcp.cfg01411c - 3 addresses
Thu Sep 14 16:28:33 2023 daemon.info dnsmasq-dhcp[1]: read /etc/ethers - 0 addresses
Thu Sep 14 16:28:33 2023 user.notice omr-bypass: OMR-ByPass is running

I did not stop /etc/init.d/omr-tracker stop >/dev/null 2>/dev/null" like in the restart all code.

It seemed to work! After this command, traffic starts going out over the v2ray proxy again. I will test again a few times. Any idea why restarting glorytun again at the end will restore traffic flow over the v2ray proxy? I see that restarting glorytun also caused a few other events, are these triggered by omr-tracker?

I think this: v2ray: Rules UP and v2ray: v2ray-rules -l 1897 -L 1897 -s vpsip --rule-name def --src-default forward --dst-default forward --local-default forward. is a v2ray reload

I will try a few more tests, just v2ray reload and see if I can reproduce these results.

If I can reproduce it, something needs to trigger a final /etc/init.d/glorytun restart >/dev/null 2>/dev/null" at the end of the sequence where OMR switches from direct output back to VPS. Where would be the best place to do this automatically?

Ysurac commented 1 year ago

When it's not working, can you do a iptables -w -t nat -L -n 2>/dev/null | grep v2r and iptables-save | grep REDIRECT ? To check if iptables rules are here or not.

ioogithub commented 1 year ago

When it's not working, can you do a iptables -w -t nat -L -n 2>/dev/null | grep v2r and iptables-save | grep REDIRECT ? To check if iptables rules are here or not.

Sure I can test this next.

Yesterday I did a lot of testing. Restarting glorytun works every time. When I look at the log after glorytun restart sometimes v2ray reloads rules, sometimes it don't. Strange that restarting glorytun fixes the proxy! Is there a known issue with glorytun in stable version? It shouldn't be so broken, <1mb download when OMR uses glorytun? This is how I found the issue because internet doesn't work. It is so slow I can barely ssh into vps.

I tested these yesterday too, none of them work: /etc/init.d/glorytun restart >/dev/null 2>/dev/null

/etc/init.d/omr-tracker start >/dev/null 2>/dev/null

/etc/init.d/v2ray restart >/dev/null 2>/dev/null

/etc/init.d/v2ray reload>/dev/null 2>/dev/null

I had lots of logs to post after but kate crashed when I went to save, sorry.

Question: Would it be a good idea to have a script that runs once a day from crontab and restarts everything from your reset script:

env -i /bin/ubus call network reload >/dev/null 2>/dev/null"
ip addr flush dev tun0 >/dev/null 2>/dev/null"
/etc/init.d/omr-tracker stop >/dev/null 2>/dev/null"
/etc/init.d/mptcp restart >/dev/null 2>/dev/null"
/etc/init.d/glorytun restart >/dev/null 2>/dev/null"
/etc/init.d/glorytun-udp restart >/dev/null 2>/dev/null"
/etc/init.d/ubond restart >/dev/null 2>/dev/null"
/etc/init.d/openvpnbonding restart >/dev/null 2>/dev/null"
/etc/init.d/omr-tracker start >/dev/null 2>/dev/null"
/etc/init.d/v2ray restart >/dev/null 2>/dev/null"

so whatever state router is in, it will reset to known good state. Is there any disadvantage to doing something like this?

ioogithub commented 1 year ago

iptables -w -t nat -L -n 2>/dev/null | grep v2r and iptables-save | grep REDIRECT

When it is working normally:

-A v2r_def_forward -p tcp -j REDIRECT --to-ports 1897

v2r_def_pre_src  tcp  --  0.0.0.0/0            0.0.0.0/0           
v2r_def_local_out  tcp  --  0.0.0.0/0            0.0.0.0/0           
Chain v2r_def_dst (1 references)
v2r_def_forward  all  --  0.0.0.0/0            0.0.0.0/0            match-set ssr_def_dst_forward dst
v2r_def_forward  all  --  0.0.0.0/0            0.0.0.0/0            /* dst_default: forward */
Chain v2r_def_forward (5 references)
Chain v2r_def_local_out (1 references)
v2r_def_forward  tcp  --  0.0.0.0/0            0.0.0.0/0            /* local_default: forward */
Chain v2r_def_pre_src (1 references)
v2r_def_src  tcp  --  0.0.0.0/0            0.0.0.0/0           
Chain v2r_def_src (1 references)
v2r_def_forward  all  --  0.0.0.0/0            0.0.0.0/0            match-set ssr_def_src_forward src
v2r_def_dst  all  --  0.0.0.0/0            0.0.0.0/0            match-set ssr_def_src_checkdst src
v2r_def_forward  all  --  0.0.0.0/0            0.0.0.0/0            /* src_default: forward */
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:19223 /* !fw3: 19223 on v2ray */ to:192.168.200.167:19223

ioogithub commented 1 year ago

Okay I have restarted the vpn, the results are very different:

iptables-save | grep REDIRECT no output

iptables -w -t nat -L -n 2>/dev/null | grep v2r

output: DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:19223 /* !fw3: 19223 on v2ray */ to:192.168.200.167:19223

Router does not recover from this state until it is rebooted or user manually validates the wizard and restarts all the services. Internet is so bad in this state I can't even load a webpage or finish a speedtest here is from fast.com:

Your Internet 420 kbps *Could not reach our servers to perform the test. You may not be connected to the internet

After running /etc/init.d/glorytun restart:

-A v2r_def_forward -p tcp -j REDIRECT --to-ports 1897

v2r_def_pre_src  tcp  --  0.0.0.0/0            0.0.0.0/0           
v2r_def_local_out  tcp  --  0.0.0.0/0            0.0.0.0/0           
Chain v2r_def_dst (1 references)
v2r_def_forward  all  --  0.0.0.0/0            0.0.0.0/0            match-set ssr_def_dst_forward dst
v2r_def_forward  all  --  0.0.0.0/0            0.0.0.0/0            /* dst_default: forward */
Chain v2r_def_forward (5 references)
Chain v2r_def_local_out (1 references)
v2r_def_forward  tcp  --  0.0.0.0/0            0.0.0.0/0            /* local_default: forward */
Chain v2r_def_pre_src (1 references)
v2r_def_src  tcp  --  0.0.0.0/0            0.0.0.0/0           
Chain v2r_def_src (1 references)
v2r_def_forward  all  --  0.0.0.0/0            0.0.0.0/0            match-set ssr_def_src_forward src
v2r_def_dst  all  --  0.0.0.0/0            0.0.0.0/0            match-set ssr_def_src_checkdst src
v2r_def_forward  all  --  0.0.0.0/0            0.0.0.0/0            /* src_default: forward */
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:19223 /* !fw3: 19223 on v2ray */ to:192.168.200.167:19223

Everything is restored. It looks like you found the problem. This might be an old issue I think I saw it over a year ago when I was trying to get OMR working. I didn't have enough time to troubleshoot so I had to put OMR project on hold for a year.

Currently to fix this issue I added a check to my aggregate script, I look at the bandwidth on tun0, if it is increasing rapidly it means OMR is not using proxy then I restart it with this '/etc/init.d/glorytun restart'. This is not a great solution.

If you have a better solution that uses OMR code, I would be happy to test it.

Ysurac commented 1 year ago

It's omr-tracker-v2ray job to set firewall config (https://github.com/Ysurac/openmptcprouter-feeds/blob/develop/omr-tracker/files/bin/omr-tracker-v2ray) When proxy is working, it should do a /etc/init.d/v2ray rules_up.

ioogithub commented 1 year ago

When proxy is working, it should do a /etc/init.d/v2ray rules_up.

So what do you think is happening here? After VPS comes back online and OMR switches back to VPS, omr-tracker-v2ray don't set the firewall rules. How does omr-tracker-v2ray know proxy is working? Maybe it runs too soon before v2ray is restored or it thinks proxy is not working so it doesn't set the rules?

Should I test with /etc/init.d/v2ray rules_upto see if this fixes the problem?
/etc/init.d/v2ray restart and /etc/init.d/v2ray reload do not fix the problem but /etc/init.d/glorytun restart always fixes it, why? does omr-tracker-v2ray see glorytun is restarted and set the v2ray rules?

ioogithub commented 1 year ago

Actually I think it is the opposite, when I restart VPS and watch OMR logs I see: omr-tracker-v2ray: V2Ray is up (can contact via http 1.0.0.1). So maybe it thinks that v2ray is already up and it doesn't bother to run. So it didn't detect that it was down.

Ysurac commented 1 year ago

omr-tracker-v2ray run in a loop, so it test based on values defined in OMR-Tracker web interface. You should test v2ray rules_up. omr-tracker-v2ray only test v2ray, I think it's omr-tracker that run rules_up when glorytun restart during it's control.

omr-tracker-v2ray check if it can get omr-tracker defined website and if ok and the command iptables -w -t nat... doesn't give result then Reload V2Ray rules. What is the result of uci show omr-tracker ? Maybe a problem in this configuration.

ioogithub commented 1 year ago

Here is the results of uci show omr-tracker

omr-tracker.defaults=defaults
omr-tracker.defaults.enabled='1'
omr-tracker.defaults.hosts='4.2.2.1' '8.8.8.8' '80.67.169.12' '8.8.4.4' '9.9.9.9' '1.0.0.1' '114.114.115.115' '1.2.4.8' '80.67.169.40' '114.114.114.114' '1.1.1.1'
omr-tracker.defaults.hosts6='2606:4700:4700::1111' '2606:4700:4700::1001' '2620:fe::fe' '2620:fe::9' '2001:4860:4860::8888' '2001:4860:4860::8844'
omr-tracker.defaults.tries='3'
omr-tracker.defaults.interval='2'
omr-tracker.defaults.interval_tries='1'
omr-tracker.defaults.wait_test='0'
omr-tracker.defaults.server_http_test='0'
omr-tracker.defaults.type='ping'
omr-tracker.defaults.mail_alert='1'
omr-tracker.defaults.timeout='5'
omr-tracker.defaults.restart_down='1'
omr-tracker.proxy=proxy
omr-tracker.proxy.hosts='1.0.0.1' '212.27.48.10' '198.27.92.1' '151.101.129.164' '77.88.55.77' '1.1.1.1' '74.82.42.42' '198.41.212.162'
omr-tracker.proxy.timeout='10'
omr-tracker.proxy.interval_tries='1'
omr-tracker.proxy.interval='5'
omr-tracker.proxy.enabled='1'
omr-tracker.proxy.tries='3'
omr-tracker.proxy.wait_test='0'
omr-tracker.server=server
omr-tracker.server.enabled='1'
omr-tracker.server.tries='3'
omr-tracker.server.timeout='10'
omr-tracker.server.wait_test='0'
omr-tracker.server.interval='5'
omr-tracker.omrvpn=interface
omr-tracker.omrvpn.type='none'
omr-tracker.omrvpn.timeout='10'
omr-tracker.omrvpn.tries='3'
omr-tracker.omrvpn.interval='5'
omr-tracker.omrvpn.enabled='1'
omr-tracker.omrvpn.server_http_test='1'
omr-tracker.omrvpn.restart_down='0'
omr-tracker.omrvpn.hosts='4.2.2.1' '8.8.8.8'
omr-tracker.omrvpn.wait_test='0'
omr-tracker.omrvpn.mail_alert='1'

I will test by running v2ray rules_up next and see if it fixes the problem.

ioogithub commented 1 year ago

I made the change https://github.com/Ysurac/openmptcprouter-feeds/blob/develop/omr-tracker/files/bin/omr-tracker-v2ray#L103 in this file I changed "v2r" to "^v2r" and restarted.

It seemed to work, in the log I see this: Fri Sep 15 14:55:07 2023 daemon.info omr-tracker-v2ray: Reload V2Ray rules and I have never seen this in the log before, so this if statement has never been run before.

It created a new problem however. When OMR restarted and traffic is switched back to proxy, all of the clients connected to the network could not resolve DNS after this. I had to go to each client of the network and turn their internet connection on and off. I am using OMR DNS because I am using OMR-bypass which was working before this change.

Any ideas on why making this change would kill DNS? I left one client alone but it never recovered on its own so I have to manually restart their networks connection after this change.

I will do another few tests and report the results.

ioogithub commented 1 year ago

I have tested two more times and the results look good. I am not able to reproduce the problem with DNS. I believe what happened is my temporary fix script that runs every 15 minutes ran during the test and brought the v2ray rules up when v2ray was not available.

I disabled this script and it looks like this change is working. I can confirm that I see this lines in the log which I never saw before making the change:

Fri Sep 15 15:46:23 2023 daemon.info omr-tracker-v2ray: Reload V2Ray rules
Fri Sep 15 15:46:23 2023 user.notice v2ray: Rules UP

Thanks @Ysurac this fix will greatly improve OMR reliability for v2ray proxy users.

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days