Ysurac / openmptcprouter

OpenMPTCProuter is an open source solution to aggregate multiple internet connections using Multipath TCP (MPTCP) on OpenWrt
https://www.openmptcprouter.com/
GNU General Public License v3.0
1.82k stars 259 forks source link

Multipath not restored after interface is restarted, OMR becomes single path TCP router. #2936

Closed ioogithub closed 6 months ago

ioogithub commented 1 year ago

Expected Behavior

Multipath should be restored to the same state after omr-tracker stops and starts an interface with all wan connections used.

Current Behavior

Multipath is broken after omr-tracker restarts an interface, traffic now only uses one wan, OpenMPTCProuter becomes OpenSPTCProuter (single path TCP router).

Steps to Reproduce the Problem

  1. Port forwarding is using v2ray and is working as expected, traffic uses both wan1 and wan2 interfaces.

  2. Start a file upload.

  3. Observe o the bandwidth page that OMR is correctly using all wan connections (wan1 and wan2): https://ibb.co/2YZDzBB

  4. OMR-tracker switches wan2 interface off (due to ping or error):

    Thu Aug 24 17:54:56 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched off because check error and ping from 10.0.0.202 error (9.9.9.9,1.0.0.1,114.114.115.115)
    Thu Aug 24 17:54:56 2023 user.notice post-tracking-post-tracking: Delete default route to x.x.x.x via y.y.y.y dev eth2
  5. OMR-tracker switches wan2 interface back on:

    Thu Aug 24 18:01:45 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up
    Thu Aug 24 18:01:47 2023 user.notice post-tracking-post-tracking: Interface route not yet set, set route ip r add default via y.y.y.y dev eth2 metric 4
    Thu Aug 24 18:01:47 2023 user.notice post-tracking-post-tracking: New public ip detected for wan2 (eth2): x.x.x.x
    Thu Aug 24 18:01:47 2023 user.notice post-tracking-post-tracking: Reload MPTCP for eth2
  6. Start a new upload, traffic now only uses wan1, router is not Multi-path any longer: https://ibb.co/94QZJLq

  7. Router never recovers from this state. Start another upload 1 hour later and it still only uses wan1.

  8. ip r shows there are routes and default routes for both wan1 and wan2 but MPTCP refuses to use wan2 after OM-tracker restarts it.

Possible Solution 1

I tried two steps to fix the problem:

  1. First I tried resetting using this command: /etc/init.d/openmptcprouter-vps restart
  2. When I observelogread -fon OMR andjournalctl -f on VPS I do not see any log events after this comamnd! This command executed and exits but it didn't do anything observable from the logs

Possible Solution 2:

  1. Click Save and Apply from the wizard page.
  2. This ultimately fixes the problem and traffic starts using wan1 and wan2 again however this is not a good solution as it majorly disrupts the network and is manual intervention.
  3. I can see from the log above: Thu Aug 24 18:01:47 2023 user.notice post-tracking-post-tracking: Reload MPTCP for eth2 I guess there is where the bug is, Reload MPTCP is not properly restoring the MPTCP bond. Is there a way to get more information on what MPTCP is doing here, is there any debug mode?

I have been tracking this problem for a long time where I see the performance of the system degrade over time. I didn't have the knowledge until recently to isolate the bug and report it. I am available to test any solutions.

Context (Environment)

The issue is bad because it effectively breaks OMR. If only a single path is used after OMR-tracker then there is no purpose to run OMR at all. Also there is no way currently to recover from this problem.

Specifications

ioogithub commented 1 year ago

I saw this for the first time ever in the log today:

Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info: service stopped (unbound 1.15.0).
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info: server stats for thread 0: 1177 queries, 0 answers from cache, 1177 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info: server stats for thread 0: requestlist max 4 avg 0.475786 exceeded 0 jostled 0
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info: average recursion processing time 0.000397 sec
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info: histogram of recursion processing times
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info: [25%]=0.00031344 median[50%]=0.000380463 [75%]=0.000447487
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info: lower(secs) upper(secs) recursions
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info:    0.000008    0.000016 4
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info:    0.000016    0.000032 4
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info:    0.000032    0.000064 10
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info:    0.000064    0.000128 3
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info:    0.000128    0.000256 21
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info:    0.000256    0.000512 1122
Wed Sep 27 20:06:26 2023 daemon.info unbound: [15963:0] info:    0.000512    0.001024 11

I don't think this has been running properly before.

ioogithub commented 1 year ago

I believe I was able to get the command to persist by editing /etc/config/omr-tracker and adding the following:

config interface 'wan2'
        option script_alert_up '/etc/init.d/glorytun restart'

will this cause any issues with other scripts that read this file?

I am seeing a new problem. Sometimes restarting glorytun is not working now. A few times I had to restart v2ray and then it restored aggregate uploads. Now I don't know which on to restart- v2ray or glorytun. I don't know what to look for to know which service needs to be restarted.

Also, the order is not right now. During the last restart, glorytun was restarted before this log line:

Wed Sep 27 23:10:58 2023 user.notice post-tracking-post-tracking: New public ip detected for wan2 (eth2): x.x.x.x

but it needs to be restarted after this time in order to restore aggregate. Is this possible?

Ysurac commented 1 year ago

To keep uci settings after reboot, you need to do a uci commit

ioogithub commented 12 months ago

Ysurac can you explain:

I was looking at why tun0 was disconnecting and I realized this:

When using v2ray with port forwarding, OMR sets these rules on the vps in /etc/shorewall/rules: DNAT net vpn:$OMR_ADDR tcp 12345 # OMR openmptcprouter redirect router 12345 port tcp but this doesn't make sense, v2ray is the proxy and this rule is forwarding the traffic from vps to router using the vpn. The vpn by default is glorytun.

I verified it by stopping tun0 and it does prevent any uploads. So this means that uploads use the glorytun tun0??

This also explains why restarting '/etc/init.d/glorytyn restart' will return the bandwidth graph back to aggregate mode again.

How does this work? Do all uploads use the glorytun vpn? Why does v2ray open ports for glorytun?

But I want uploads to use v2ray the same as downloads, how to configure this?

If my whole upload aggregate problem in this thread actually a glorytun problem not a v2ray problem?

Ysurac commented 12 months ago

V2Ray is used only for port forwarding when the checkbox V2Ray is checked in the port forwarding configuration.

ioogithub commented 11 months ago

V2Ray is used only for port forwarding when the checkbox V2Ray is checked in the port forwarding configuration.

Yes this is what I want but I don't think it is working. I think uploads are using the tun0 (glorytun) not the v2ray according to status page and restarting services.

To test:

  1. setup port forwarding on port 1234 and check v2ray. confirm:

    DNAT            net             vpn:$OMR_ADDR   tcp     1234    # OMR openmptcprouter redirect router 1234 port tcp
  2. From a server on the internet, download a file from the port 1234: curl https://domain:1234/file.bin -o /dev/null

  3. Observe status page: https://host/cgi-bin/luci/admin/system/openmptcprouter/status -> Proxy traffic (v2ray) is not increasing, VPN traffic (tun0) is is increasing!

  4. On the router, restart v2ray while file is transferring: /etc/init.d/v2ray restart. Observe bandwidth page (https://host/cgi-bin/luci/admin/network/mptcp/bandwidth): -> Transfer continues

  5. On the router, restart glorytun while file is transferring: /etc/init/d/glorytun restart. Oobserve bandwidth page (https://host/cgi-bin/luci/admin/network/mptcp/bandwidth): -> Transfer is stopped

So uploads using port forward seems to not use proxy (v2ray) but actually uses tun0 glorytun (vpn).

ioogithub commented 11 months ago

I double checked, v2ray was not checked!!! On this new install I was using glorytun all along for uploads.

Okay this makes senses now, behavior is as expected.

xzjq commented 10 months ago

I seem to experience this issue as well. I have three WANs and they do not remain aggregated for bandwidth for established connections over time (status page indicates everything is fine, but the MPTCP bandwidth page only shows one or two of the interfaces being used for traffic). As @ioogithub noted, the disaggregation seems to occur after omr-tracker slays an interface, e.g. "post-tracking-001-post-tracking: wan1 (eth0.300) switched off because check error and ping" even though the interface is subsequently switched up within a minute or two.

New connections will continue to use aggregated bandwidth, but established connections (e.g. a TCP-based VPN on a client device that is tunneled via the router) will only use 1 or 2 of the WANS. Aggregation for those existing connections can be restored by going to the Settings Wizard and clicking "save and apply" with nothing changed, and of course this flushes the MPTCP route table, etc, as can be seen in logread. Similarly, kicking the (non-OMR) VPN tunnel established on the client device will also restore aggregation for that connection.

This is on [v0.60beta1-5.4 r0+16862-170d9e447d]

xzjq commented 10 months ago

The scenario seems to be that MPTCP has aggregated connections, but when an interface drops it loses the subflows for that connection (this makes sense). If the interface comes back, MPTCP does not "heal" and add replacement subflows using that interface for existing connections.

I do not know whether MPTCP even has the ability to add subflows for currently existing/ongoing connections. However, when the interface is up/status page is all green, it seems to create new connections that use multipath/aggregation.

E.g. when I have a client device TCP VPN running for hours, it will eventually disaggregate and use only 1-2 WANs (one is always the master), as evidenced by looking at the bandwidth graphs where one or more WANs has 0 traffic.

Restarting the client device TCP VPN can restore aggregated performance, where all WANs show 40+ Mbps traffic during speedtests.

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days