Ysurac / openmptcprouter

OpenMPTCProuter is an open source solution to aggregate multiple internet connections using Multipath TCP (MPTCP) on OpenWrt
https://www.openmptcprouter.com/
GNU General Public License v3.0
1.73k stars 252 forks source link

Multipath not restored after interface is restarted, OMR becomes single path TCP router. #2936

Closed ioogithub closed 3 months ago

ioogithub commented 10 months ago

Expected Behavior

Multipath should be restored to the same state after omr-tracker stops and starts an interface with all wan connections used.

Current Behavior

Multipath is broken after omr-tracker restarts an interface, traffic now only uses one wan, OpenMPTCProuter becomes OpenSPTCProuter (single path TCP router).

Steps to Reproduce the Problem

  1. Port forwarding is using v2ray and is working as expected, traffic uses both wan1 and wan2 interfaces.

  2. Start a file upload.

  3. Observe o the bandwidth page that OMR is correctly using all wan connections (wan1 and wan2): https://ibb.co/2YZDzBB

  4. OMR-tracker switches wan2 interface off (due to ping or error):

    Thu Aug 24 17:54:56 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched off because check error and ping from 10.0.0.202 error (9.9.9.9,1.0.0.1,114.114.115.115)
    Thu Aug 24 17:54:56 2023 user.notice post-tracking-post-tracking: Delete default route to x.x.x.x via y.y.y.y dev eth2
  5. OMR-tracker switches wan2 interface back on:

    Thu Aug 24 18:01:45 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up
    Thu Aug 24 18:01:47 2023 user.notice post-tracking-post-tracking: Interface route not yet set, set route ip r add default via y.y.y.y dev eth2 metric 4
    Thu Aug 24 18:01:47 2023 user.notice post-tracking-post-tracking: New public ip detected for wan2 (eth2): x.x.x.x
    Thu Aug 24 18:01:47 2023 user.notice post-tracking-post-tracking: Reload MPTCP for eth2
  6. Start a new upload, traffic now only uses wan1, router is not Multi-path any longer: https://ibb.co/94QZJLq

  7. Router never recovers from this state. Start another upload 1 hour later and it still only uses wan1.

  8. ip r shows there are routes and default routes for both wan1 and wan2 but MPTCP refuses to use wan2 after OM-tracker restarts it.

Possible Solution 1

I tried two steps to fix the problem:

  1. First I tried resetting using this command: /etc/init.d/openmptcprouter-vps restart
  2. When I observelogread -fon OMR andjournalctl -f on VPS I do not see any log events after this comamnd! This command executed and exits but it didn't do anything observable from the logs

Possible Solution 2:

  1. Click Save and Apply from the wizard page.
  2. This ultimately fixes the problem and traffic starts using wan1 and wan2 again however this is not a good solution as it majorly disrupts the network and is manual intervention.
  3. I can see from the log above: Thu Aug 24 18:01:47 2023 user.notice post-tracking-post-tracking: Reload MPTCP for eth2 I guess there is where the bug is, Reload MPTCP is not properly restoring the MPTCP bond. Is there a way to get more information on what MPTCP is doing here, is there any debug mode?

I have been tracking this problem for a long time where I see the performance of the system degrade over time. I didn't have the knowledge until recently to isolate the bug and report it. I am available to test any solutions.

Context (Environment)

The issue is bad because it effectively breaks OMR. If only a single path is used after OMR-tracker then there is no purpose to run OMR at all. Also there is no way currently to recover from this problem.

Specifications

ioogithub commented 10 months ago

I have more information. I can reproduce the bug, not with omr-tracker but by stopping the interface (wan2) and starting again. After I start it, OMR never uses it for MPTCP so it becomes single path router. Here is the log wan2 started again:

Thu Aug 24 19:54:50 2023 kern.info kernel: [23103.714924] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Thu Aug 24 19:54:56 2023 daemon.notice netifd: wan2 (30320): udhcpc: sending select for x.x.x.x
Thu Aug 24 19:54:56 2023 daemon.notice netifd: wan2 (30320): udhcpc: lease of x.x.x.x obtained, lease time 172800
Thu Aug 24 19:54:56 2023 daemon.notice netifd: Interface 'wan2' is now up
Thu Aug 24 19:54:56 2023 daemon.notice netifd: Network device 'eth2' link is down
Thu Aug 24 19:54:56 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Thu Aug 24 19:54:57 2023 user.notice MPTCP: Flush route cache
Thu Aug 24 19:54:57 2023 user.notice firewall: Reloading firewall due to ifup of wan2 (eth2)
Thu Aug 24 19:54:57 2023 user.notice firewall.omr-server: Firewall reload, set server part firewall reloading
Thu Aug 24 19:54:57 2023 user.notice mptcp: Reloading mptcp config due to ifup of wan2 (eth2)
Thu Aug 24 19:54:59 2023 user.notice post-tracking-post-tracking: Set firewall on server vps
Thu Aug 24 19:55:00 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Aug 24 19:55:00 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Aug 24 19:55:00 2023 kern.info kernel: [23113.674687] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Thu Aug 24 19:55:02 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up
Thu Aug 24 19:55:46 2023 user.notice post-tracking-post-tracking: Check API configuration...
Thu Aug 24 19:55:52 2023 user.notice post-tracking-post-tracking: Check API configuration... Done

So I guess Thu Aug 24 19:54:57 2023 user.notice mptcp: Reloading mptcp config due to ifup of wan2 (eth2) is the problem.

I checked ip r before restarting the interface and after bringing it back up. All routes the the same but how can I tell MPTCP to use both wan1 and wan2 after wan2 comes back online?

fareign commented 10 months ago

try to reload v2ray? only some VPN use MPTCP.

ioogithub commented 10 months ago

try to reload v2ray? only some VPN use MPTCP. I tried to reload with /etc/init.d/openmptcprouter-vps restart but it executes and exits and doesn't restart anything that I can see in the logs.

If I reload with /etc/init.d/v2ray reload it doesn't seem to fix the problem, traffic still comes in over only 1 wan and I see these errors:

[info] DNS disabled: main_dns
[info] Setting transparent proxy on port: 1234
[info] Transparent proxy mode: default
Flush terminated
ss-rules6: unknown option def
iptables-restore: line 2 failed
iptables-restore: line 2 failed
iptables-restore: line 2 failed
iptables-restore: line 2 failed
iptables-restore: line 2 failed
iptables-restore: line 2 failed
iptables-restore: line 2 failed
iptables-restore: line 2 failed
Warning: Section 'zone_vpn' cannot resolve device of network 'omr6in4'
Warning: Option @redirect[3].v2ray is unknown
Warning: Option @redirect[4].v2ray is unknown
Warning: Section @redirect[4] (19224 on v2ray) does not specify a protocol, assuming TCP+UDP

If I restart with /etc/init.d/v2ray restart then it does work but it creates a new problem, all clients on the network get disconnected so this is too disruptive as omr-tracker can restart an interface at any time.

MPTCP is working perfectly until an interface is restarted or omr-tracker tracker bring an interface down and up. After that this interface cannot be used for multi path.

There has to be a way to tell v2ray that the multi path is available to be used again. When I bring the interface down, OMR knows how to fail over to one interface successfully, it should be able to recover when the interface comes back up.

Ysurac commented 10 months ago

Do you have same issue using Shadowsocks ? And do you have same issue using 6.1 snapshot image ?

ioogithub commented 10 months ago

Do you have same issue using Shadowsocks ? I just did 3 tests with v2ray and 3 tests with shadowsocks. The problem affects both v2ray and shadowsocks proxies.

Test:

  1. reboot OMR, start file upload transfer, observe normal multi-path
  2. restart wan2 interface
  3. transfer a file (upload), observe traffic on only wan1 interface.

I can repeat the test by restating wan1 and same result. After interface restarts the proxy does not use it for multi-path.

Here is the result:

Normal multi-path: before restart

After wan2 interface is reset, multi-path is broken: after restart

I am looking at the logs after restarting an interface trying to find out what is not running. I see one log entry: Fri Aug 25 13:10:49 2023 daemon.notice netifd: wan2 (23088): Command failed: Permission denied but it doesn't say what command failed.

Please help me to troubleshoot:

I think this is a serious problem because omr-tracker can restart interface at any time and that disables multipath on OMR router, user will not realize that OMR is running in degraded performance.

If OMR can successfully degrade the router to single path it should be able to upgrade it back to multi-path as well right?

Ysurac commented 10 months ago

It's not a know problem and I don't have this issue on my 0.59.1 (and still not with latest snapshot). What is the result of multipath, ip a and ip r commands ?

ioogithub commented 10 months ago

Did you test it with an upload so it would use the port forward and the proxy, either shadowsocks or v2ray same result for me?

I think I saw download is working once but I am not sure because testing uploads now.

I am looking at ip a and ip r and ip rule and multipath. I will compare results of working fresh reboot of OMR and results after interface is restarted and incoming transfers not using multi-oath.

ioogithub commented 10 months ago

I did lots of tests:

  1. restart OMR, capture ip r, ip a, ip rule, multipath settings
  2. restart wan2
  3. capture all settings
  4. compare settings from step 1 and 3.

The routes, rules, interfaces and multipath settings are all identical when its working and when it is broken after restarting an interface.

It's not a know problem and I don't have this issue on my 0.59.1 (and still not with latest snapshot). What is the result of multipath, ip a and ip r commands ?

Here are the outputs, everything looks normal there is nothing to me:

ip r

default via 10.255.255.1 dev tun0 
default metric 1 
        nexthop via 10.6.0.1 dev eth1 weight 2 
        nexthop via 10.0.0.1 dev eth2 weight 1 
default via 10.6.0.1 dev eth1 metric 3 
default via 10.0.0.1 dev eth2 metric 4 
default via 10.255.255.1 dev tun0 metric 1200 
10.0.0.0/24 via 10.0.0.1 dev eth2 
10.0.0.0/24 dev eth2 scope link metric 4 
10.255.255.1 dev tun0 proto kernel scope link src 10.255.255.2 
10.255.255.2 dev tun0 scope link metric 1200 
6.1.2.2 dev eth1 proto static scope link src 10.6.8.1 metric 3 
10.6.0.0/10 dev eth1 scope link metric 3 
127.0.0.0/8 dev lo proto static scope link metric 6 
x.x.x.x metric 1 
        nexthop via 10.6.0.1 dev eth1 weight 10 
        nexthop via 10.0.0.1 dev eth2 weight 1 

multipath:

bond0 is deactivated
erspan0 is deactivated
eth0 is deactivated
eth1 is in default mode
eth2 is in default mode
eth3 is deactivated
gre0 is deactivated
gretap0 is deactivated
ifb4tun0 is deactivated
ip6gre0 is deactivated
ip6tnl0 is deactivated
lo is deactivated
sit0 is deactivated
teql0 is deactivated
tun0 is deactivated

ip a:

1: lo: <LOOPBACK,UP,LOWER_UP,80000> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ip6tnl0@NONE: <NOARP,80000> mtu 1452 qdisc noop state DOWN group default qlen 1000
    link/tunnel6 :: brd ::
3: sit0@NONE: <NOARP,80000> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0
4: gre0@NONE: <NOARP,80000> mtu 1476 qdisc noop state DOWN group default qlen 1000
    link/gre 0.0.0.0 brd 0.0.0.0
5: gretap0@NONE: <BROADCAST,MULTICAST,80000> mtu 1462 qdisc noop state DOWN group default qlen 1000
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
6: erspan0@NONE: <BROADCAST,MULTICAST,80000> mtu 1450 qdisc noop state DOWN group default qlen 1000
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
7: ip6gre0@NONE: <NOARP,80000> mtu 1448 qdisc noop state DOWN group default qlen 1000
    link/gre6 :: brd ::
8: bond0: <BROADCAST,MULTICAST,MASTER,80000> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether b1:b1:b1:b1:b1:b1 brd ff:ff:ff:ff:ff:ff
9: teql0: <NOARP,80000> mtu 1500 qdisc noop state DOWN group default qlen 100
    link/void 
10: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP,80000> mtu 1500 qdisc mq state UP group default qlen 100
    link/ether a:aa:aa:aa:aa:aa brd ff:ff:ff:ff:ff:ff
    inet 192.168.x.x/24 brd 192.168.200.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 bbbb:bbbb:bbbb::1/60 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 cccc:cccc:cccc:cccc:cccc/64 scope link 
       valid_lft forever preferred_lft forever
11: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 100
    link/ether dd:dd:dd:dd:dd:dd brd ff:ff:ff:ff:ff:ff
    inet 10.6.8.1/10 brd 100.127.255.255 scope global eth1
       valid_lft forever preferred_lft forever
12: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP group default qlen 100
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.2/24 brd 10.0.0.255 scope global eth2
       valid_lft forever preferred_lft forever
13: eth3: <NO-CARRIER,BROADCAST,MULTICAST,UP,80000> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether a1:a1:a1:a1:a1:a1 brd ff:ff:ff:ff:ff:ff
    inet 192.168.a.b/24 brd 192.168.200.255 scope global eth3
       valid_lft forever preferred_lft forever
21: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP,80000> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 100
    link/none 
    inet 10.255.255.2 peer 10.255.255.1/32 scope global tun0
       valid_lft forever preferred_lft forever
    inet6 a2a2:a2a2:a2a2:a2a2:a2a2/64 scope link stable-privacy 
       valid_lft forever preferred_lft forever
24: ifb4tun0: <BROADCAST,NOARP,80000> mtu 1500 qdisc noop state DOWN group default qlen 32
    link/ether a3:a3:a3:a3:a3:a3 brd ff:ff:ff:ff:ff:ff

Do you have any ideas of what else I can look at, its like the proxy get stuck on a single path or it doesn't get notified that multi-path is available again.

Questions:

If ip r, ip rule etc. are the same with it is working and not working it must be something else, what else should I look for?

Ysurac commented 10 months ago

Only mptcp init script is used, to set route table and multipath command to set multipath status of an interface. I don't understand exactly the issue, it's only with port forwarding ? In this case with shadowsocks as Proxy only VPN is used and with v2ray it's used if you have the checkbox enabled in firewall configuration. It's aggregated in download and not on port forwarding upload in both case ?

What is the MPTCP scheduler used ?

ioogithub commented 10 months ago

mptcp init script is used, to set route table

this script sets route table: /etc/init.d/mptcp

multipath command to set multipath status of an interface

Is this command inside the mptcp init script?

But if the route table and multipath status is the same before and after the interface is restarted then the problem is somewhere else.

Does this mean that only the routes and multipath status of interface determines if the proxy will use multipath? So the proxy pushes data out and the kernel uses the routing table to route over multiple interfaces? Does this mean the proxy doesn't know about MPTCP? There are no other settings that notify the proxy?

I don't understand exactly the issue, it's only with port forwarding ?

I think it is only uploads, when a client from outside the network requests data from a computer inside the network after an interface has been restarted, it does not use that interface. It sends data only on the other interface. In this case port forwarding is used. I don't know if port forwarding is involved but downloads do not use that port forwarding.

In this case with shadowsocks as Proxy only VPN is used

For this I made used used different port forwarding rules which swtching to shadowsocks.

with v2ray it's used if you have the checkbox enabled in firewall configuration.

Yes I am using the v2ray checkbox port forwarding rule.

All of this works normally before an interface is stopped and started or restarted. After an interface is restarted there is no way to get OMR to use that interface for uploads.

I see the problem with both v2ray and shadowsocks:

It seems like the proxy gets stuck, it doesn't get notified that both wan1 and wan2 are available.

I will start a fresh test:

  1. reboot omr (known working state)
  2. restart wan1
  3. start an upload and look at bandwidth for aggregate.
  4. start a download and look at bandwidth for aggregate.

What is the MPTCP scheduler used ?

scheduler: BLEST congestion control: reno path manager: full mesh TCP sync: 2 MPTCP version: 0 MPTCP checksum: disabled

There is an MPTCP debug if I set it will we get more data in the logs?

If you have any more ideas please let me know.

ioogithub commented 10 months ago

I have done a lot of testing today, here is the result:

v2ray multi-path

uploads

downloads

shadowsocks multi-path

uploads

downloads

Conclusion

I initially thought the problem was uploads for shadowsocks and v2ray but I was wrong. When shadowsocks, sometimes it starts working only after a delay of up to 2 minutes. Sometimes it starts right away. In my previous tests I did not wait long enough. I did see it working today.

So the problem is isolated to a specific use case: v2ray uploads (traffic initiated from outside the network using the v2ray port forwarding) only. I have never seen this work.

I have also tried the following:

Nothing I can do will restore aggregate multi-path for uploads with v2ray once an interface is restarted. Traffic always flows though the one interface that was not restarted.

The only way to recover from this is to click Save and Apply on the Wizard screen or reboot OMR.

Question: how is the MPTCP actually working on OMR? The v2ray config.json doesn't have MPTCP set in the stream settings, it is just TCP so I guess v2ray doesn't even know about MPTCP so how does it even work?

It's not a know problem and I don't have this issue on my 0.59.1 (and still not with latest snapshot). What is the result of multipath, ip a and ip r commands ?

You tested tested uploads with port forwarding using v2ray as the proxy with 0.59.1 and it is working for you correct?

  1. You are able to start an upload
  2. see traffic flowing over multiple interfaces in the bandwidth graph
  3. click on restart for one interface
  4. observe the interface is back online on the status page and logread
  5. observe that traffic stops flowing over the restarted interface.
  6. wait 10 minutes
  7. start another upload, observe traffic still only flowing on the unstarted interface.

If this works for you and it does not work for me that suggests a configuration problem. Tomorrow I will setup a VPS with the 6.1 upstream kernel and boot my router with openmptcprouter-v0.59.2alpha-6.1-r0+23789-ce6ad123e7-x86-64-generic-ext4-combined-efi, configure v2ray and port forwarding and repeat the tests above. I will report the results back here.

ioogithub commented 10 months ago

Here is the result of testing with openmptcprouter v0.59.2alpha-6.1 r0+23789-ce6ad123e7.

  1. create new VPS: wget -O - http://www.openmptcprouter.com/server-test/debian-x86_64.sh | UPSTREAM6="yes" sh
  2. book router with image: openmptcprouter-v0.59.2alpha-6.1-r0+23789-ce6ad123e7-x86-64-generic-ext4-combined-efi.img.gz

Configure openwrt:

  1. Set LAN IP, set wan1 and wan2 to default and set force link
  2. Set server IP and server key.
  3. Create one port forwarding rule
  4. Everything else defaults.

shadowsocks upload: multipath working shadowsocks upload after interface reset: multipath working

Configure V2ray:

  1. Set v2ray vless
  2. Create port forwarding, no v2ray check.
  3. Edit port forward and add v2ray check.
  4. Check /etc/shorewall/rules, correct.
  5. Confirm proxy is receiving traffic on OMR status page. Known issue: No proxy displayed here.
  6. Confirm traffic on port 65228 (tcpdump -i eth port 65228: yes

Test v2ray upload after interface restart.

  1. Start an upload using v2ray: multi-path working v2rayeforerestart

  2. Reset wan1 interface: multi-path not working v2rayafterrestartwan1

Same results as the latest stable release. Uploads do not use aggregate multi-path with v2ray as the proxy after an interface is restarted.

There is currently no way to recover from this.

With stable, at lease Save and Apply or reboot OMR will fix the problem. In 6.1 this does not work. So the this issue at this time it is better to stay with stable until a work around is found.

ioogithub commented 10 months ago

This is an example of an event that kills aggregate uploads:

Aug 28 03:33:19 OpenMPTCProuter user.notice post-tracking-post-tracking: wan2 (eth2) switched off because check error and ping from x.x.x.x error (1.0.0.1,114.114.115.115,1.2.4.8)
Aug 28 03:33:19 OpenMPTCProuter user.notice post-tracking-post-tracking: Delete default route to y.y.y.y via x.x.x.x dev eth2
Aug 28 03:41:15 OpenMPTCProuter daemon.notice netifd: Network device 'eth2' link is down
Aug 28 03:41:15 OpenMPTCProuter daemon.notice netifd: Interface 'wan2' has link connectivity loss
Aug 28 03:41:22 OpenMPTCProuter daemon.notice netifd: Network device 'eth2' link is up
Aug 28 03:41:22 OpenMPTCProuter daemon.notice netifd: Interface 'wan2' has link connectivity 
Aug 28 03:41:25 OpenMPTCProuter user.notice post-tracking-post-tracking: wan2 (eth2) switched up
Aug 28 03:41:25 OpenMPTCProuter user.notice post-tracking-post-tracking: Interface route not yet set, set route ip r add default via x.x.x.x dev eth2 metric 4
Aug 28 03:41:26 OpenMPTCProuter user.notice post-tracking-post-tracking: New public ip detected for wan2 (eth2): a.a.a.a
Aug 28 03:41:26 OpenMPTCProuter user.notice post-tracking-post-tracking: Reload MPTCP for eth2

The OMR-tracker brings the interface down and up again. After this, OMR is in single mode and will not use wan2 again until the server is restarted so this issue has a big impact on OMR stability and reliability.

There has to be a way to have uploads recover without manual user intervention after these interface reset events.

Ysurac commented 10 months ago

On latest snapshot, I'm not able to reproduce the issue for now using "iperf3" to test upload speed with V2Ray VLESS. How do you test upload speed ? By downloading externally using port forward ?

ioogithub commented 10 months ago

Yes, I am doing a real world test like this:

external client -> VPS (public IP) -> port forward OMR via v2ray vless -> host on LAN.

What exact test are you running? omr-iperfon OMR? If I do that right now this is what I get but I only see one wan connection use so maybe this test is not correct test: https://ibb.co/L5W55Tg

For my test, after OMR is restarted, If I do a download from an external client (upload) in a good state, I get this: https://ibb.co/LvLDFn8

If I do a download from an external client (upload) after an interface was restarted (by omr-tracker) or manually, I get this: https://ibb.co/tDfRGD6

It never recovers from this. omr-tracker reset wan2 8 hours ago: Aug 28 07:43:41 OpenMPTCProuter user.notice post-tracking-post-tracking: wan2 (eth2) switched off because check error and ping from x.x.x.x error (1.0.0.1,114.114.115.115,1.2.4.8), so OMR has been sitting in single path mode for uploads since this time.

If I do this: /etc/init/d/v2ray restart or wizard validate on the stable version, it will return to normal aggregate however this operation is disruptive, it kills all client connections and is manual and it only lasts until omr-tracker resets one of the interfaces again.

ioogithub commented 10 months ago

here is a new upload test as discussed using iperf and a new port foward setup for the test.

iperf test:

result:

result, bandwidth never return to wan2: https://ibb.co/x8tJqQf start a new file transfer: https://ibb.co/PY1vNsz you can see, on the first screenshot, when wan2 goes down, omr switches to wan1 but after wan2 comes back up it never recovers.

The problem is with stable /etc/init.d/v2ray restart would fix the problem but disconnect all clients on the network. With 6.1 testing, after running /etc/init.d/v2ray restart packets do not leave the vps and get to omr. The only way to recover and start a new test is to restart both OMR and VPS. Validating wizard and v2ray restart, packets do not get past vps with tcpdump -i eth0 port 19223

Ysurac commented 10 months ago

I've tested, using V2RAY VLESS protocol redirect port to internal http server and downloading file from external and no problem, when I remove a connection and put it back it's used again. I have also no problem with v2ray restart.

Can you give me full log when you test disconnect and then reconnect a connection ? I will reset my install to a fresh config for more tests.

ioogithub commented 10 months ago

Can you give me full log when you test disconnect and then reconnect a connection ?

Yes, please let me know exactly which log you are looking for ? logread from OMR, journalctl -f from vps or something else? Is there any debug mode I can enable to get more info?

Ysurac commented 10 months ago

I would need system log from the router, so the result of logread.

I tested with a new VPS and router installation and I have no issue. What is the result of uci show network via SSH on router ?

ioogithub commented 10 months ago

I have been testing with the 6.1 kernel and switched back to 0.51.9 recently as it is easier for me to recover.

What is the result of uci show network

network.loopback=interface
network.loopback.proto='static'
network.loopback.ipaddr='127.0.0.1'
network.loopback.netmask='255.0.0.0'
network.loopback.multipath='off'
network.loopback.device='lo'
network.loopback.metric='6'
network.globals=globals
network.globals.mptcp_subflows='3'
network.globals.mptcp_add_addr_accepted='1'
network.globals.mptcp_add_addr_timeout='120'
network.globals.mptcp_fullmesh_num_subflows='1'
network.globals.mptcp_fullmesh_create_on_err='1'
network.globals.mptcp_ndiffports_num_subflows='1'
network.globals.ula_prefix='abac:4c32:5cbb::/48'
network.globals.mptcp_version='0'
network.globals.mptcp_rr_cwnd_limited='Y'
network.globals.mptcp_rr_num_segments='1'
network.globals.multipath='enable'
network.globals.mptcp_syn_retries='2'
network.globals.mptcp_scheduler='blest'
network.globals.congestion='reno'
network.globals.mptcp_path_manager='fullmesh'
network.globals.mptcp_checksum='0'
network.globals.mptcp_debug='0'
network.lan=interface
network.lan.proto='static'
network.lan.ipaddr='192.168.x.x'
network.lan.netmask='255.255.255.0'
network.lan.device='eth0'
network.lan.ifname='eth0'
network.lan.delegate='0'
network.lan.addlatency='0'
network.lan.multipath='off'
network.lan.ip4table='lan'
network.lan.metric='7'
network.lan.label='toMesh'
network.lan.defaultroute='0'
network.lan.peerdns='0'
network.lan.ip6assign='60'
network.lan.ipv6='0'
network.lan.force_link='1'
network.lan_rule=rule
network.lan_rule.lookup='lan'
network.lan_rule.priority='100'
network.wan1=interface
network.wan1.device='eth1'
network.wan1.ip4table='wan'
network.wan1.defaultroute='0'
network.wan1.addlatency='0'
network.wan1.ipv6='0'
network.wan1.peerdns='0'
network.wan1.proto='dhcp'
network.wan1.label='starlink'
network.wan1.force_link='1'
network.wan1.metric='3'
network.wan1.multipath='master'
network.wan1_dev=device
network.wan1_dev.name='eth1'
network.wan1_dev.txqueuelen='20'
network.wan2=interface
network.wan2.device='eth2'
network.wan2.ip4table='wan'
network.wan2.defaultroute='0'
network.wan2.addlatency='0'
network.wan2.ipv6='0'
network.wan2.peerdns='0'
network.wan2.proto='dhcp'
network.wan2.force_link='1'
network.wan2.metric='4'
network.wan2.label='lte'
network.wan2.multipath='on'
network.wan2_dev=device
network.wan2_dev.name='eth2'
network.wan2_dev.txqueuelen='20'
network.omrvpn=interface
network.omrvpn.device='tun0'
network.omrvpn.ip4table='vpn'
network.omrvpn.multipath='off'
network.omrvpn.leasetime='12h'
network.omrvpn.type='tunnel'
network.omrvpn.txqueuelen='100'
network.omrvpn.metric='1200'
network.omrvpn.proto='none'
network.omr6in4=interface
network.omr6in4.proto='6in4'
network.omr6in4.ip4table='vpn'
network.omr6in4.multipath='off'
network.omr6in4.ipaddr='10.255.255.2'
network.omr6in4.peeraddr='10.255.255.1'
network.omr6in4.ip6addr='fe80::a00:2/126'
network.omr6in4.gateway='fe80::a00:1/126'
network.omr6in4.force_link='1'
network.omr6in4.metric='1201'
network.lan_dev=device
network.lan_dev.name='eth0'
network.wan3_dev=device
network.wan3_dev.type='macvlan'
network.wan3_dev.mode='vepa'
network.wan3_dev.ifname='eth0'
network.wan3_dev.name='wan3'
network.wan3_dev.txqueuelen='20'
network.localNIC=interface
network.localNIC.proto='static'
network.localNIC.device='eth3'
network.localNIC.netmask='255.255.255.0'
network.localNIC.multipath='off'
network.localNIC.addlatency='0'
network.localNIC.metric='11'
network.localNIC.ipaddr='192.168.y.y'
network.localNIC.ipv6='0'
network.localNIC.defaultroute='0'
network.localNIC.peerdns='0'
network.localNIC_dev=device
network.localNIC_dev.name='eth3'
network.@route[2]=route
network.@route[2].interface='lan'
network.@route[2].target='192.168.a.a'
network.@route[2].netmask='255.255.255.0'
network.@route[2].gateway='192.168.x.xyy'

I would need system log from the router, so the result of logread.

Here is the log from the moment I clicked reset on the wan2 interface. After this I just see expected lease renewal messages, nothing else related to connections and the log is quiet.

Thu Aug 31 14:13:42 2023 daemon.notice netifd: Interface 'wan2' is now down
Thu Aug 31 14:13:42 2023 kern.info kernel: [ 3351.120795] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Thu Aug 31 14:13:42 2023 daemon.notice netifd: Interface 'wan2' is disabled
Thu Aug 31 14:13:42 2023 daemon.notice netifd: Interface 'wan2' is enabled
Thu Aug 31 14:13:42 2023 daemon.notice netifd: Interface 'wan2' is setting up now
Thu Aug 31 14:13:42 2023 daemon.notice netifd: Network device 'eth2' link is down
Thu Aug 31 14:13:42 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Thu Aug 31 14:13:42 2023 daemon.notice netifd: wan2 (29374): udhcpc: started, v1.33.2
Thu Aug 31 14:13:42 2023 daemon.notice netifd: wan2 (29374): udhcpc: sending discover
Thu Aug 31 14:13:42 2023 kern.err kernel: [ 3351.487765] __mptcp_init4_subsockets: token 0xa463fb6a bind() to y.y.y.y index 12 failed, error -99
Thu Aug 31 14:13:43 2023 user.notice MPTCP: Flush route cache
Thu Aug 31 14:13:44 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched off because wan2 may have ip issues, interface have no IPv4, interface have no IPv4 gateway
Thu Aug 31 14:13:44 2023 user.notice post-tracking-post-tracking: Delete default route to x.x.x.x via  dev eth2
Thu Aug 31 14:13:45 2023 daemon.notice netifd: wan2 (29374): udhcpc: sending discover
Thu Aug 31 14:13:46 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Aug 31 14:13:46 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Aug 31 14:13:46 2023 kern.info kernel: [ 3354.700767] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Thu Aug 31 14:13:49 2023 daemon.notice netifd: wan2 (29374): udhcpc: sending discover
Thu Aug 31 14:13:52 2023 daemon.notice netifd: wan2 (29374): udhcpc: sending select for y.y.y.y
Thu Aug 31 14:13:52 2023 daemon.notice netifd: wan2 (29374): udhcpc: lease of y.y.y.y obtained, lease time 172800
Thu Aug 31 14:13:52 2023 daemon.notice netifd: Interface 'wan2' is now up
Thu Aug 31 14:13:52 2023 daemon.notice netifd: Network device 'eth2' link is down
Thu Aug 31 14:13:52 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Thu Aug 31 14:13:52 2023 user.notice firewall: Reloading firewall due to ifup of wan2 (eth2)
Thu Aug 31 14:13:52 2023 user.notice firewall.omr-server: Firewall reload, set server part firewall reloading
Thu Aug 31 14:13:53 2023 user.notice post-tracking-post-tracking: Set firewall on server vps
Thu Aug 31 14:13:54 2023 daemon.err glorytun[14226]: read: Operation timed out
Thu Aug 31 14:13:54 2023 daemon.info glorytun[14226]: STOPPED tun0
Thu Aug 31 14:13:55 2023 daemon.err glorytun[14226]: x.x.x.x.65001: connected
Thu Aug 31 14:13:55 2023 daemon.info glorytun[14226]: STARTED tun0
Thu Aug 31 14:13:55 2023 user.notice MPTCP: Add route z.z.z.z/24 via z.z.z.b dev eth2
Thu Aug 31 14:13:55 2023 user.notice MPTCP: Add route a.a.a.a via a.a.a.b dev eth0
Thu Aug 31 14:13:55 2023 user.notice mptcp: Reloading mptcp config due to ifup of wan2 (eth2)
Thu Aug 31 14:13:55 2023 user.notice MPTCP: Add route z.z.z.z/24 via z.z.z.b dev eth2
Thu Aug 31 14:13:56 2023 user.notice MPTCP: Add route a.a.a.a via a.a.a.b dev eth0
Thu Aug 31 14:13:56 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Aug 31 14:13:56 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Aug 31 14:13:56 2023 kern.info kernel: [ 3364.671214] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Thu Aug 31 14:13:56 2023 user.notice MPTCP: Flush route cache
Thu Aug 31 14:13:56 2023 user.notice MPTCP: Add route z.z.z.z/24 via z.z.z.b dev eth2
Thu Aug 31 14:13:56 2023 user.notice MPTCP: Add route a.a.a.a via a.a.a.b dev eth0
Thu Aug 31 14:13:56 2023 user.notice MPTCP: Flush route cache
Thu Aug 31 14:13:58 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up
Thu Aug 31 14:14:01 2023 user.notice MPTCP: Add route z.z.z.z/24 via z.z.z.b dev eth2
Thu Aug 31 14:14:01 2023 user.notice MPTCP: Add route a.a.a.a via a.a.a.b dev eth0
Thu Aug 31 14:14:45 2023 user.notice post-tracking-post-tracking: Check API configuration...
Thu Aug 31 14:14:53 2023 user.notice post-tracking-post-tracking: Check API configuration... Done

After this point wan2 does not recover: Screenshot_20230831_141709

Ysurac commented 10 months ago

Can you try to do multipath eth2 off && multipath eth2 on to check if this change something ?

ioogithub commented 10 months ago

multipath eth2 off && multipath eth2 on

  1. start a transfer after interface reset when OMR is in single wan state
  2. multipath eth2 off && multipath eth2 on
  3. No interruption or change to the transfer
  4. Also no notice in logread -f from this command.

Result: I cannot see any effect from this command.

There was one error after i reset wan2: Thu Aug 31 14:13:42 2023 kern.err kernel: [ 3351.487765] __mptcp_init4_subsockets: token 0xa463fb6a bind() to y.y.y.y index 12 failed, error -99

Could this be the problem it happens when wan2 is brought down.

I am going to reboot OMR to a known good state and watch the logs closely again.

ioogithub commented 10 months ago

Here is a log with two question I have.

  1. OMR rebooted at 15:58. At 16:14 this message:

Thu Aug 31 16:14:21 2023 kern.info kernel: [ 1079.764399] igc 0000:04:00.0: changing MTU from 1500 to 1460 and this change causes eth2 to go down and up:

Thu Aug 31 16:14:21 2023 daemon.notice netifd: Network device 'eth2' link is down
Thu Aug 31 16:14:21 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Thu Aug 31 16:14:25 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Aug 31 16:14:25 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Aug 31 16:14:25 2023 kern.info kernel: [ 1083.381921] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

Is there an OMR process that is monitoring something and trying to optimize MTU and changes it while the interface is up and running? The consequence is that it restarts the interface after and then the interface isn't used for uploads because of this issue.

  1. I also see these messages repeated:
Thu Aug 31 16:02:06 2023 user.notice MPTCP: Flush route cache
Thu Aug 31 16:02:06 2023 user.notice MPTCP: Add route 10.x.x.x/24 via 10.x.x.y dev eth2
Thu Aug 31 16:02:06 2023 user.notice omr-bypass: Starting OMR-ByPass...
Thu Aug 31 16:02:07 2023 user.notice omr-bypass: OMR-ByPass is running
Thu Aug 31 16:02:07 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Thu Aug 31 16:02:07 2023 user.notice omr-tracker: Launching...
Thu Aug 31 16:02:14 2023 user.notice omr-tracker: Launched
Thu Aug 31 16:02:15 2023 daemon.info omr-tracker-v2ray: V2Ray is up (can contact via http 1.0.0.1)

Sometimes I see these message every few minutes and sometimes I do not see them for 1 hour in the log. What are these for and what determines how often they will run? Is this normal?

Also, is glorytun actually retarting everytime this is logged: Thu Aug 31 16:02:07 2023 daemon.info glorytun: starting glorytun vpn instance vpn

ioogithub commented 10 months ago

I rebooted OMR at 17:00, same issue, the kernel or some OMR process changed the MTU again at 17:13:

Thu Aug 31 17:14:35 2023 kern.info kernel: [  966.328211] igc 0000:04:00.0: changing MTU from 1500 to 1460
Thu Aug 31 17:14:36 2023 daemon.notice netifd: Network device 'eth2' link is down
Thu Aug 31 17:14:36 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Thu Aug 31 17:14:39 2023 user.notice post-tracking-post-tracking: Check API configuration...
Thu Aug 31 17:14:39 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Aug 31 17:14:39 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Aug 31 17:14:39 2023 kern.info kernel: [  969.825546] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

and then the log file is quite, no omr-tracker or glorytun log messages for 2 hours.

What is the purpose of resetting the MTU and why don't I see the omr-tracking, glorytun, and mtcp flush cache messages any longer?

ioogithub commented 10 months ago

As soon as I initiate 1 file upload, these messages appear again, every few minutes. Are they related to connections or file transfers? Are they related to this issue?

Thu Aug 31 19:29:27 2023 user.notice omr-tracker: Launching...
Thu Aug 31 19:29:34 2023 user.notice omr-tracker: Launched
Thu Aug 31 19:29:34 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Thu Aug 31 19:29:34 2023 user.notice MPTCP: Flush route cache
Thu Aug 31 19:29:34 2023 user.notice MPTCP: Add route 10.a.a.a/24 via 10.0.0.1 dev eth2
Thu Aug 31 19:29:34 2023 user.notice omr-bypass: Starting OMR-ByPass...
Thu Aug 31 19:29:35 2023 daemon.info omr-tracker-v2ray: V2Ray is up (can contact via http 1.0.0.1)
Thu Aug 31 19:29:35 2023 user.notice omr-bypass: OMR-ByPass is running
Thu Aug 31 19:29:35 2023 user.notice omr-tracker: Launching...
Thu Aug 31 19:29:42 2023 user.notice omr-tracker: Launched
Thu Aug 31 19:29:42 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Thu Aug 31 19:29:42 2023 user.notice MPTCP: Flush route cache
Thu Aug 31 19:29:43 2023 user.notice MPTCP: Add route 10.a.a.a/24 via 10.0.0.1 dev eth2
Thu Aug 31 19:29:43 2023 user.notice omr-bypass: Starting OMR-ByPass...
Thu Aug 31 19:29:43 2023 daemon.info omr-tracker-v2ray: V2Ray is up (can contact via http 1.0.0.1)
Thu Aug 31 19:29:43 2023 user.notice omr-bypass: OMR-ByPass is running
Thu Aug 31 19:29:51 2023 user.notice omr-tracker: Launching...
Thu Aug 31 19:29:58 2023 user.notice omr-tracker: Launched
Thu Aug 31 19:29:58 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Thu Aug 31 19:29:58 2023 user.notice MPTCP: Flush route cache
Thu Aug 31 19:29:58 2023 user.notice MPTCP: Add route 10.a.a.a/24 via 10.0.0.1 dev eth2
 Aug 31 19:29:58 2023 user.notice omr-bypass: Starting OMR-ByPass...
Thu Aug 31 19:29:59 2023 user.notice omr-bypass: OMR-ByPass is running
Thu Aug 31 19:29:59 2023 daemon.info omr-tracker-v2ray: V2Ray is up (can contact via http 1.0.0.1)
Ysurac commented 10 months ago

You can fix MTU in Network->Interfaces for each interfaces, then it will not try to calculate MTU (but changing MTU should not put the interface down).

ioogithub commented 10 months ago

You can fix MTU in Network->Interfaces for each interfaces, then it will not try to calculate MTU (but changing MTU should not put the interface down).

Are you sure? I always see a reset after an MTU change:

I get this:

Fri Sep 1 01:08:08 2023 kern.info kernel: [ 3654.871161] igc 0000:04:00.0: changing MTU from 1500 to 1460

Fri Sep  1 01:08:08 2023 daemon.notice netifd: Network device 'eth2' link is down
Fri Sep  1 01:08:08 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Fri Sep  1 01:08:11 2023 daemon.notice netifd: Network device 'eth2' link is up
Fri Sep  1 01:08:11 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Fri Sep  1 01:08:11 2023 kern.info kernel: [ 3658.468358] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

I can also reproduce it with this:

ip link set dev eth0 mtu 1460

it will always reset the interface. Fri Sep 1 01:08:08 2023 daemon.notice netifd: Network device 'eth0' link is down

but this MTU just contributes to the problem, it doesn't fix the root cause of aggregate broken after interface restarted.

What about this error error after i reset wan2: Thu Aug 31 14:13:42 2023 kern.err kernel: [ 3351.487765] __mptcp_init4_subsockets: token 0xa463fb6a bind() to y.y.y.y index 12 failed, error -99 could this be causing the issue or is this just from the interface going down?

Ysurac commented 10 months ago

Seems that changing MTU put interface down in some case, I will change the code. The mptcp error is due to interface down yes. I will try to reproduce the issue by using DHCP on interfaces.

ioogithub commented 10 months ago

I am looking closely at the logs. When I reboot OMR I see these errors related to v2ray, could any of these cause the issue?

Fri Sep  1 12:40:06 2023 daemon.notice procd: /etc/rc.d/S99v2ray: ss-rules6: unknown option def
Fri Sep  1 12:40:06 2023 user.notice v2ray: Reload omr-bypass rules
Fri Sep  1 12:40:06 2023 daemon.notice procd: /etc/rc.d/S99v2ray: iptables-restore: line 2 failed
Fri Sep  1 12:40:06 2023 daemon.notice procd: /etc/rc.d/S99v2ray: iptables-restore: line 2 failed
Fri Sep  1 12:40:06 2023 daemon.notice procd: /etc/rc.d/S99v2ray: iptables-restore: line 2 failed
Fri Sep  1 12:40:06 2023 daemon.notice procd: /etc/rc.d/S99v2ray: iptables-restore: line 2 failed
Fri Sep  1 12:40:06 2023 daemon.notice procd: /etc/rc.d/S99v2ray: iptables-restore: line 2 failed
Fri Sep  1 12:40:07 2023 daemon.notice procd: /etc/rc.d/S99v2ray: Warning: Section 'zone_vpn' cannot resolve device of network 'omr6in4'
Fri Sep  1 12:40:07 2023 daemon.notice procd: /etc/rc.d/S99v2ray: Warning: Option @redirect[3].v2ray is unknown
Fri Sep  1 12:40:07 2023 daemon.notice procd: /etc/rc.d/S99v2ray: Warning: Option @redirect[4].v2ray is unknown
Fri Sep  1 12:40:07 2023 daemon.notice procd: /etc/rc.d/S99v2ray: Warning: Section @redirect[4] (19223 on v2ray) does not specify a protocol, assuming TCP+UDP
Fri Sep  1 12:40:07 2023 daemon.info v2ray[26149]: V2Ray 4.43.0 (V2Fly, a community-driven edition of V2Ray.) OpenWrt R1 (go1.17.3 linux/amd64)
Fri Sep  1 12:40:07 2023 daemon.info v2ray[26149]: A unified platform for anti-censorship.
Fri Sep  1 12:40:07 2023 daemon.info v2ray[26149]: 2023/09/01 16:40:07 [Info] main/jsonem: Reading config: /var/etc/v2ray/v2ray.main.json

The redirect 3 and 4 are the port forwards. At the end of the boot however aggregate does work so maybe this is normal?

From all my testing is seems aggregate uploads stop working always after an interface reset so booting OMR or reloading v2ray fixes it but there are several automated processes like MTU or omr-tracker which reset the interface.

I manually set MTU so that will prevent DHCP from restarting the interfaces. Is there anything I can do to prevent omr-tracker or restart, I don't want to disable it because it is necessary right?

Is there anything else I can adjust or configure to prevent automated interface restarts?

Ysurac commented 10 months ago

The errors doesn't cause issues. For omr-tracker, you can change settings in Services->OMR-Tracker. OMR-Tracker is necessary to set routes, check services, get some connection info,... The problem is more why connection are no more in V2ray in your case.

ioogithub commented 10 months ago

Strange event happened tonight.

  1. from VPS I see this event:
Sep 05 20:34:53 vps glorytun-tcp-run[7114]: read: Connection timed out
Sep 05 20:34:54 vps glorytun-tcp-run[7114]: ::ffff:35.x.x.x.59324: connected
Sep 05 20:34:54 vps glorytun-tcp-run[7114]: bad packet [E210xxxxxxxxxxxxxxxxx] !
Sep 05 20:34:54 vps glorytun-tcp-run[7114]: ::ffff:35.x.x.x.59324: key exchange failed
Sep 05 20:35:04 vps glorytun-tcp-run[7114]: ::ffff:a.a.a.a.24919: connected

This ip starting with 35.x.x.x, its a Google cloud ip form the UK. I am not close to this location, is this a hack attempt? did someone try to connect to my glorytun from the vps from this 35.x.x.x ip address?

After this, eth2 goes down:

Tue Sep  5 20:35:03 2023 daemon.err glorytun[14750]: read: Operation timed out
Tue Sep  5 20:35:03 2023 daemon.info glorytun[14750]: STOPPED tun0
Tue Sep  5 20:35:04 2023 daemon.err glorytun[14750]: vpsIP.65001: connected
Tue Sep  5 20:35:04 2023 daemon.info glorytun[14750]: STARTED tun0
Tue Sep  5 20:35:21 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched off because check error and ping from 10.x.x.x error (4.2.2.1,8.8.8.8,80.67.169.12)
Tue Sep  5 20:35:21 2023 user.notice post-tracking-post-tracking: Delete default route to vpsip via 10.x.x.x dev eth2
Tue Sep  5 20:42:43 2023 user.notice omr-bypass: Starting OMR-ByPass...
Tue Sep  5 20:42:44 2023 user.notice omr-bypass: OMR-ByPass is running
Tue Sep  5 20:42:44 2023 user.notice MPTCP: Flush route cache
Tue Sep  5 20:42:44 2023 user.notice MPTCP: Add route 10.0.0.0/24 via 10.x.x.x dev eth2
Tue Sep  5 20:42:44 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Tue Sep  5 20:42:44 2023 user.notice omr-tracker: Launching...
Tue Sep  5 20:42:51 2023 user.notice omr-tracker: Launched
Tue Sep  5 20:42:52 2023 daemon.info omr-tracker-v2ray: V2Ray is up (can contact via http 1.0.0.1)
Tue Sep  5 20:43:06 2023 daemon.notice netifd: Network device 'eth2' link is down
Tue Sep  5 20:43:06 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Tue Sep  5 20:43:06 2023 kern.info kernel: [176347.654817] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Down
Tue Sep  5 20:43:13 2023 daemon.notice netifd: Network device 'eth2' link is up
Tue Sep  5 20:43:13 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Tue Sep  5 20:43:13 2023 kern.info kernel: [176354.494963] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Tue Sep  5 20:43:17 2023 user.notice post-tracking-post-tracking: Interface route not yet set, set route ip r add default via 10.x.x.x dev eth2 metric 4
Tue Sep  5 20:43:18 2023 user.notice post-tracking-post-tracking: New public ip detected for wan2 (eth2): b.b.b.b

OMR does not recover and bring eth2 back online until I manually restart the interface on Network-> Interfaces page. Should OMR be able to recover from this type of even without user intervention? I waited 8 minutes before I restarted the interface.

After this, I see the typical bug, traffic only one the other wan. I restart v2ray on the vps: systemctl restart v2ray.service and it is back to normal.

Questions:

  1. Does this look like a hack attempt? An unknown IP send a bad packet to glorytun, caused key exchange to fail and this brought down an interface.
  2. Should OMR be able to recover from this type of event without manual intervention? If so how and what config do I need to set.
  3. restarting v2ray restored aggregate bandwidth. @Ysurac I want to issue this command: systemctl restart v2ray.service when an interface goes down and is restored. Its not perfect solution but its better than leaving OMR is single mode or rebooting OMR manually. What is the best way to do this? Which script should I add this code too and where in the script should I add the line?
ioogithub commented 10 months ago

I see two more strange log entries on the vps:

167.x.x.x IP address trying to connect to v2ray?

Sep 04 14:16:11 vps v2ray[823534]: 2023/09/04 14:16:11 167.2x.xx.xx:55732 rejected  proxy/vless/encoding: failed to read request version > read tcp 172.16.0.7:65228->167.2x.xx.xx:55732: read: connection reset by peer
Sep 04 14:16:12 vps v2ray[823534]: 2023/09/04 14:16:12 167.2x.xx.xx:51324 rejected  proxy/vless/encoding: invalid request version
Sep 04 14:16:12 vps v2ray[823534]: 2023/09/04 14:16:12 167.2x.xx.xx:42920 rejected  proxy/vless/encoding: failed to read request version > read tcp 172.16.0.7:65228->167.2x.xx.xx:42920: read: connection reset by peer
Sep 04 14:16:13 vps v2ray[823534]: 2023/09/04 14:16:13 167.2x.xx.xx:48674 rejected  proxy/vless/encoding: failed to read request version > tls: first record does not look like a TLS handshake
Sep 04 14:16:14 vps v2ray[823534]: 2023/09/04 14:16:14 167.2x.xx.xx:59622 rejected  proxy/vless/encoding: failed to read request version > read tcp 172.16.0.7:65228->167.2x.xx.xx:59622: read: connection reset by peer

Another 167 IP (different) trying to connec to shadowsocks:

Sep 05 11:50:46 vps ss-server[809]: server recv: Connection reset by peer
Sep 05 11:50:46 vps ss-server[809]: failed to handshake with 167.9x.xx.xx: authentication error
Sep 05 11:50:49 vps ss-server[809]: server recv: Connection reset by peer
Sep 05 11:50:53 vps ss-server[809]: server recv: Connection reset by peer
Sep 05 11:50:53 vps ss-server[809]: server recv: Connection reset by peer

What do these logs mean, can outside IP addresses connect to shadowsocks, v2ray and glorytun servers running on the VPS or is this just normal log behavior?

Ysurac commented 10 months ago

Yes these IP are some attack, often a scan done on all ports. Packet is send but without the key and even without the good protocol they are rejected. Nothing to worry about.

OMR should be able to recover an interface but in your case it seems the interface doesn't answer correctly before an up and down. There is an option in Services->OMR-Tracker to "Restart if down" an interface that do exactly the same as "Restart" button in Network->Interfaces.

When an interface is down on the router you restart v2ray on the VPS ? Possible to restart v2ray on router but not on VPS side when an interface is down.

ioogithub commented 10 months ago

Yes these IP are some attack, often a scan done on all ports. Packet is send but without the key and even without the good protocol they are rejected. Nothing to worry about.

Yes I understand this without the correct auth they are rejected but can this type of attack cause a problem? Eth2 went down after the glorytun-tcp-run bad packet. Bad packet: 20:34:54 tun0 restarted: 20:35:04 eth2 down: 20:35:21

Did this attack cause eth2 to go down? If so then even if its not effective access it still causes a problem.

OMR should be able to recover an interface but in your case it seems the interface doesn't answer correctly before an up and down.

Is there something I should investigate to get it working correctly or should I just add the check?

There is an option in Services->OMR-Tracker to "Restart if down" an interface that do exactly the same as "Restart" button in Network->Interfaces.

I set this, it looks like it happened again at 3am. No attack this time just this: Wed Sep 6 03:34:21 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched off because check error and ping from 10.x.x.x error (80.67.169.40,114.114.114.114,1.1.1.1)

What causes this? Sometimes I see 1 week no down event then 1 day it happens 1 or 2 times. Do I need to adjust ome-tracker, is the ping timeout too low? I have these settings under default: timetout: 5 tries: 3 retry: 2 wait: 0

Which setting should I adjust if I see frequent eth drops?

When an interface is down on the router you restart v2ray on the VPS ? Possible to restart v2ray on router but not on VPS side when an interface is down.

Yes I wanted to ask you about this. Is it better to restart on vps or OMR? If I restart on vps it seems not too disruptive. I can have a file transfer running and I see the bandwidth dips but doesn't actually terminate the transfer. It does terminate other connections like irc. What is the difference? Is one better than the other? If I restart on OMR, what command do I use? /etc/init.d/v2ray restart? I can test on OMR if it is better to restart here.

Which script should I modify so that it automatically restarts v2ray after an interface restarts? Is it Wed Sep 6 03:34:21 2023 user.notice post-tracking-post-tracking or something else?

If so which function should I add the restart?

Ysurac commented 10 months ago

The attack is on the VPS, not related with eth2 on the router.

You can increase timeout, tries and retry but it's a connection issue.

Strange that v2ray restart on VPS change something... You can set to restart v2ray on interface up: uci set openmptcprouter.wan2.script_alert_up='/etc/init.d/v2ray restart'

ioogithub commented 10 months ago

I found something interesting while looking for a solution. I noticed that v2ray spans a lot of unconnected UDP sockets that never have any data in their sent or received queues on the vps. Here is an example from this command: ss -tuln -p:

udp            UNCONN          0               0                                     *:19912                              *:*             users:(("v2ray",pid=1339875,fd=109))
udp            UNCONN          0               0                                     *:37331                              *:*             users:(("v2ray",pid=1339875,fd=90))
udp            UNCONN          0               0                                     *:11731                              *:*             users:(("v2ray",pid=1339875,fd=410))
udp            UNCONN          0               0                                     *:43477                              *:*             users:(("v2ray",pid=1339875,fd=320))
udp            UNCONN          0               0                                     *:27097                              *:*             users:(("v2ray",pid=1339875,fd=456))
udp            UNCONN          0               0                                     *:18911                              *:*             users:(("v2ray",pid=1339875,fd=536))
udp            UNCONN          0               0                                     *:44513                              *:*             users:(("v2ray",pid=1339875,fd=429))
udp            UNCONN          0               0                                     *:56803                              *:*             users:(("v2ray",pid=1339875,fd=347))
udp            UNCONN          0               0                                     *:43491                              *:*             users:(("v2ray",pid=1339875,fd=539))
udp            UNCONN          0               0                                     *:41446                              *:*             users:(("v2ray",pid=1339875,fd=427))
udp            UNCONN          0               0                                     *:26092                              *:*             users:(("v2ray",pid=1339875,fd=274))
udp            UNCONN          0               0                                     *:28143                              *:*             users:(("v2ray",pid=1339875,fd=466))
udp            UNCONN          0               0                                     *:19952                              *:*             users:(("v2ray",pid=1339875,fd=498))
udp            UNCONN          0               0                                     *:63992                              *:*             users:(("v2ray",pid=1339875,fd=541))

v2ray was restarted 1h20m ago and it currently has 276 of these open,unused sockets which seems high. Is this normal?

I also see a flood of connect message in the journalctl from the vps like this:

Sep 07 00:06:56 vps v2ray[1339875]: 2023/09/07 00:06:56 a.a.a.a:2557 accepted udp:x.x.x.x:28777 [direct] email: openmptcprouter
Sep 07 00:06:56 vps v2ray[1339875]: 2023/09/07 00:06:56 a.a.a.a:17109 accepted udp:x.x.x.x:51415 [direct] email: openmptcprouter
Sep 07 00:06:56 vps v2ray[1339875]: 2023/09/07 00:06:56 a.a.a.a:42446 accepted udp:x.x.x.x:58168 [direct] email: openmptcprouter
Sep 07 00:06:56 vps v2ray[1339875]: 2023/09/07 00:06:56 a.a.a.a:60231 accepted udp:x.x.x.x:49001 [direct] email: openmptcprouter
Sep 07 00:06:56 vps v2ray[1339875]: 2023/09/07 00:06:56 a.a.a.a:3317 accepted udp:x.x.x.x:51035 [direct] email: openmptcprouter

For example for this second (00:06:56) there were 30 entries. Usually its closer to 1-5 every second. Are these journal logs related to the open sockets? I recognize a lot of these IP addresses on the omt-tracker page. Does omr-tracker open up new UDP socket for every connection attempt? Another strange thing about these connection logs is (a.a.a.a) is always wan1, I never see any for wan2. Is this also normal?

My current fix for this single-path aggregate problem is to restart v2ray and incoming transfers on my port use aggregate again. I noticed roughly that I would have to restart every 1 or 2 hours to restore multipath aggregate. Is v2ray becoming overwhelmed by this large amount of open sockets or is this a normal amount? why are they opened and not closed? Why always UDP?

Ysurac commented 10 months ago

You shouldn't have so many UDP sockets if you don't use UDP too much. What are the destination IPs ? OMR-Tracker only do http request for the proxy part and ping request for wan part if you didn't set it to DNS check. In the log you see the initial wan connection, it's normal. It's a large amount with some random destination port, this seems really strange...

ioogithub commented 10 months ago

You shouldn't have so many UDP sockets if you don't use UDP too much.

I don't have many this behavior doesn't look right to me.

What are the destination IPs ?

The destination IPs are always the IP address of WAN1. I did this journalctl | grep wan2IP and I don't see any results.

OMR-Tracker only do http request for the proxy part and ping request for wan part if you didn't set it to DNS check.

Proxy: Enabled

Server tracker: Enabled

Default settings: Enabled Mail Alert Restart if down

OMR VPN: Enabled Server http test Mail alert

There are so many messages in the journalctl that I always have to do something like this journalctl -f -n 50000 | grep -v -e accepted to view it for general troubleshooting.

Also these show up as warnings, why are connection events warnings: from config.json "loglevel": "warning", what is v2ray warning about if these are just connection events?

Why do these requests only go to wan1 and why are there so many of them? Maybe this has something to do with the problem, after a while (around 1-2 hours) v2ray aggregate stops working, it always works after a restart. Also after 1-2 hours there are hundreds of unconnected UDP sockets open.

I restarted v2ray 30 minutes ago and I am already at 242 sockets open: sudo ss -tuln -p | grep v2ray | wc -l 253

There are all like this on different port numbers: udp UNCONN 0 0 *:13591 *:* users:(("v2ray",pid=3146957,fd=264))

Here is an example of a legitimate port that I am forwarding: tcp LISTEN 0 4096 *:19223 *:* users:(("v2ray",pid=3146957,fd=9)) but what is opening all these UDP unconnected sockets?

In the log you see the initial wan connection, it's normal.

I will double check, are you asking for jounrlactl log for wan connection when vps is booted or something else?

Ysurac commented 10 months ago

Destination IP is WAN IP ? I think it's source IP. Destination IP is x.x.x.x in your journalctl log. OMR-Tracker use http so port 80 for proxy check. For OMR-Tracker connection status, DNS check is set if you change this (default is ping): image V2Ray is verbose in warning mode, I changed this in develop script to error mode. As I said, it's normal that you see only one IP, OpenMPTCProuter doesn't do some round robin. You need to find what is opening so much UDP port, maybe something running on a client computer on LAN side.

ioogithub commented 10 months ago

I just had a chance to test this change we made earlier: uci set openmptcprouter.wan2.script_alert_up='/etc/init.d/v2ray restart' it adds this config to /etc/config/openmptcprputer

config interface 'wan2'
        option metric '4'
        option multipathvpn '0'
        option asn 'ROGERS-COM-DPS'
        option mptcp_status 'MPTCP enabled'
        option multipath 'on'
        option mtu '1460'
        option manufacturer 'huawei'
        option script_alert_up '/etc/init.d/v2ray restart'
        option publicip '97.109.191.12'
        option state 'up'
        option lc '1694116043'
        option latency_previous '41'
        option latency '30'

but it did not work. eth2 was just restated by omr-tracker, it came back online but v2ray did not restart. Here is the log:

Thu Sep  7 15:46:17 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched off because check error and ping from 10.x.x.x error (9.9.9.9,1.0.0.1,114.114.115.115)
Thu Sep  7 15:46:17 2023 user.notice post-tracking-post-tracking: Delete default route to 1vpsip via 10.0.0.1 dev eth2
Thu Sep  7 15:46:19 2023 user.notice post-tracking-post-tracking: Restart wan2
Thu Sep  7 15:46:19 2023 daemon.notice netifd: wan2 (28088): udhcpc: received SIGTERM
Thu Sep  7 15:46:19 2023 daemon.notice netifd: wan2 (28088): udhcpc: unicasting a release of 10.x.x.x to 10.0.0.1
Thu Sep  7 15:46:19 2023 daemon.notice netifd: wan2 (28088): udhcpc: sending release
Thu Sep  7 15:46:19 2023 daemon.notice netifd: wan2 (28088): udhcpc: entering released state
Thu Sep  7 15:46:19 2023 daemon.notice netifd: wan2 (28088): Command failed: Permission denied
Thu Sep  7 15:46:19 2023 daemon.notice netifd: Interface 'wan2' is now down
Thu Sep  7 15:46:19 2023 kern.info kernel: [88622.022865] igc 0000:04:00.0: changing MTU from 1460 to 1500
Thu Sep  7 15:46:19 2023 daemon.notice netifd: Interface 'wan2' is disabled
Thu Sep  7 15:46:19 2023 daemon.notice netifd: Network device 'eth2' link is down
Thu Sep  7 15:46:19 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Thu Sep  7 15:46:20 2023 user.notice MPTCP: Flush route cache
Thu Sep  7 15:46:24 2023 kern.info kernel: [88626.981973] igc 0000:04:00.0: changing MTU from 1500 to 1460
Thu Sep  7 15:46:24 2023 daemon.notice netifd: Interface 'wan2' is enabled
Thu Sep  7 15:46:24 2023 daemon.notice netifd: Interface 'wan2' is setting up now
Thu Sep  7 15:46:24 2023 daemon.notice netifd: wan2 (11044): udhcpc: started, v1.33.2
Thu Sep  7 15:46:24 2023 daemon.notice netifd: wan2 (11044): udhcpc: sending discover
Thu Sep  7 15:46:27 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Sep  7 15:46:27 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Sep  7 15:46:27 2023 kern.info kernel: [88629.977387] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Thu Sep  7 15:46:27 2023 daemon.notice netifd: wan2 (11044): udhcpc: sending discover
Thu Sep  7 15:46:33 2023 daemon.notice netifd: Interface 'wan2' is now up
Thu Sep  7 15:46:33 2023 daemon.notice netifd: Network device 'eth2' link is down
Thu Sep  7 15:46:33 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Thu Sep  7 15:46:33 2023 user.notice firewall: Reloading firewall due to ifup of wan2 (eth2)
Thu Sep  7 15:46:33 2023 user.notice firewall.omr-server: Firewall reload, set server part firewall reloading
Thu Sep  7 15:46:33 2023 user.notice mptcp: Reloading mptcp config due to ifup of wan2 (eth2)
Thu Sep  7 15:46:34 2023 user.notice MPTCP: Flush route cache
Thu Sep  7 15:46:35 2023 user.notice post-tracking-post-tracking: Set firewall on server vps
Thu Sep  7 15:46:37 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Sep  7 15:46:37 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Sep  7 15:46:37 2023 kern.info kernel: [88640.106992] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Thu Sep  7 15:46:38 2023 user.notice MPTCP: Flush route cache
Thu Sep  7 15:46:39 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up
Thu Sep  7 15:48:05 2023 user.notice MPTCP: Flush route cache
Thu Sep  7 15:48:06 2023 user.notice omr-bypass: Starting OMR-ByPass...
Thu Sep  7 15:48:06 2023 user.notice omr-bypass: OMR-ByPass is running
Thu Sep  7 15:48:06 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Thu Sep  7 15:48:06 2023 user.notice omr-tracker: Launching...
Thu Sep  7 15:48:13 2023 user.notice omr-tracker: Launched
Thu Sep  7 15:48:14 2023 daemon.info omr-tracker-v2ray: V2Ray is up (can contact via http 1.0.0.1)
Thu Sep  7 15:48:37 2023 user.notice omr-tracker: Launching...
Thu Sep  7 15:48:44 2023 user.notice omr-tracker: Launched
Thu Sep  7 15:48:44 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Thu Sep  7 15:48:44 2023 user.notice omr-bypass: Starting OMR-ByPass...
Thu Sep  7 15:48:45 2023 user.notice omr-bypass: OMR-ByPass is running
Thu Sep  7 15:48:45 2023 user.notice MPTCP: Flush route cache
Thu Sep  7 15:48:49 2023 user.notice omr-bypass: Starting OMR-ByPass...

Here is the code that detected wan2 was back up:

Thu Sep  7 15:46:37 2023 daemon.notice netifd: Network device 'eth2' link is up
Thu Sep  7 15:46:37 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Thu Sep  7 15:46:39 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up

post-tracking-post-tracking can see wan2 is switched up but /etc/init.d/v2ray restart never ran. Is this the right place or is there somewhere else I can put this?

Ysurac commented 10 months ago

Yes, I made a mistake, it's uci set omr-tracker.wan2=interface && uci set omr-tracker.wan2.script_alert_up='/etc/init.d/v2ray restart'

ioogithub commented 10 months ago

Yes, I made a mistake, it's uci set omr-tracker.wan2=interface && uci set omr-tracker.wan2.script_alert_up='/etc/init.d/v2ray restart'

Thanks, I will try this new command. Is there a reverse command to delete the old one?

Also, I am seeing some error messages on the vps like this:


Sep 06 00:59:57 vps v2ray[2580939]: 2023/09/06 00:59:57 x.x.x.x:17668 rejected  proxy/vless/encoding: failed to read request version > read tcp b.b.b.b:65228->x.x.x.x:17668: read: connection reset by peer
Sep 08 16:30:16 vps v2ray[3697154]: 2023/09/08 16:30:16 x.x.x.x:53988 rejected  proxy/vless/encoding: failed to read request version > tls: first record does not look like a TLS handshake
Sep 08 16:30:41 vps v2ray[3697154]: 2023/09/08 16:30:41 x.x.x.x:60627 rejected  proxy/vless/encoding: failed to read request version > tls: first record does not look like a TLS handshake

x.x.x.x is the public ip of my wan1 router interface. Are these error typical, what causes then, could they be a problem?

Ysurac commented 10 months ago

uci del openmptcprouter.wan2.script_alert_up No idea for the V2Ray errors.

ioogithub commented 10 months ago

No idea for the V2Ray errors.

Have you never seen this error before? I have another one on the vps, a.a.a.a is wan1. and x.x.x.x the vps IP:

Sep 09 11:21:30 vps v2ray[3697154]: 2023/09/09 11:21:30 a.a.a.a:9035 rejected proxy/vless/encoding: failed to read request version > read tcp x.x.x.x:65228->a.a.a.a:9035: read: connection reset by peer

There is no corresponding error on the router at this time.

What does the message mean, is the v2ray connection between the vps and the router being terminated, is there anyway to troubleshoot this further? There is what I found "This could be caused by various factors such as incorrect configuration files, inconsistent system time or id between the server and client, or blocked v2ray servers"

ioogithub commented 10 months ago

It seems to happen once per day on average:

Sep 11 15:55:59 vps v2ray[818]: 2023/09/11 15:55:59 y.y.y.y:12679 rejected proxy/vless/encoding: failed to read request version > read tcp x.x.x.x:65228->y.y.y.y:12679: read: connection reset by peer

Can you think of anyway to troubleshoot this? Why is v2ray server rejecting v2ray client on OMR? maybe this is contributing the the problem.

ioogithub commented 9 months ago

Yes, I made a mistake, it's uci set omr-tracker.wan2=interface && uci set omr-tracker.wan2.script_alert_up='/etc/init.d/v2ray restart'

I am modifying my new install to restart glorytun and not v2ray when an interface restarts because this fixes aggreegate. It is not working. Here is the new command:

uci set omr-tracker.wan2=interface && uci set omr-tracker.wan2.script_alert_up='/etc/init.d/glorytun restart'

uci show:

omr-tracker.wan2=interface
omr-tracker.wan2.script_alert_up='/etc/init.d/glorytun restart'

when i restart eth2, run logread | grep glorytun <no results>

This command does not run. It did work when I had the v2ray, why won't this command run to restart glorytun? Is this command correct?

I confirm, aggregate is not automatically restored. I do need to find a way to automatically restart glorytun on an interface restart.

ioogithub commented 9 months ago

okay it is actually working, my test was wrong, it was restarting the interface, when i did a proper stop and start it did trigger:


Wed Sep 27 16:49:35 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up
Wed Sep 27 16:49:35 2023 daemon.info glorytun[22860]: STOPPED tun0
Wed Sep 27 16:49:35 2023 daemon.notice netifd: Network device 'tun0' link is down
Wed Sep 27 16:49:35 2023 daemon.notice netifd: Interface 'omrvpn' has link connectivity loss
Wed Sep 27 16:49:35 2023 daemon.notice netifd: Interface 'omrvpn' is now down
Wed Sep 27 16:49:35 2023 daemon.info glorytun: starting glorytun vpn instance vpn
Wed Sep 27 16:49:35 2023 daemon.notice netifd: Interface 'omrvpn' is disabled
Wed Sep 27 16:49:36 2023 daemon.info glorytun[27727]: INITIALIZED tun0
Wed Sep 27 16:49:36 2023 daemon.notice netifd: Interface 'omrvpn' is enabled
Wed Sep 27 16:49:36 2023 daemon.notice netifd: Network device 'tun0' link is up
Wed Sep 27 16:49:36 2023 daemon.notice netifd: Interface 'omrvpn' has link connectivity
Wed Sep 27 16:49:36 2023 daemon.notice netifd: Interface 'omrvpn' is setting up now
Wed Sep 27 16:49:36 2023 daemon.notice netifd: Interface 'omrvpn' is now up

I did not see this before. Aggregate is restored but this is not a great solution it is too disruptive, is there any other way to fix this issue, I am using v2ray for TCP and UDP so why does restarting the glorytun VPN fix aggregate issues with v2ray?

ip route show table 3 and ip route show table 4

is the same when there is no aggregate and when it is working again.

ioogithub commented 9 months ago

After restarting glorytun I got a kernel panic, first time I have seen this on omr:

Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.153871] ------------[ cut here ]------------
Wed Sep 27 16:51:16 2023 kern.crit kernel: [ 5122.154394] Kernel BUG at dql_completed+0x140/0x150 [verbose debug info unavailable]
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.155374] invalid opcode: 0000 [#1] SMP NOPTI
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.155837] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G         C        5.4.194 #0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.156816] Hardware name: redacted
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.157748] RIP: 0010:dql_completed+0x140/0x150
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.158212] Code: f8 ea 00 48 89 77 58 e9 4d ff ff ff 45 85 e4 41 0f 95 c4 41 39 db 0f 95 c0 41 84 c4 74 05 45 85 ed 78 09 89 c8 e9 2f ff ff ff <0f> 0b 01 f6 89 c8 29 f0 41 0f 48 c2 eb 8d 90 90 55 48 89 e5 41 56
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.159706] RSP: 0018:ffffc90000003e38 EFLAGS: 00010287
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.160196] RAX: 0000000000000002 RBX: ffffc90004a05d20 RCX: 0000000000000056
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.160660] RDX: 000000000000004a RSI: 00000000000000a0 RDI: ffff888173817340
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.161114] RBP: ffffc90000003e50 R08: 0000000000000000 R09: 0000000000017389
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.161557] R10: 0000000000000000 R11: 000000000001733f R12: ffff8881752daa80
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.161991] R13: ffff888173817280 R14: ffff888173135460 R15: 00000000ffffff46
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.162424] FS:  0000000000000000(0000) GS:ffff88817c000000(0000) knlGS:0000000000000000
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.163363] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.163790] CR2: 0000560d17df9000 CR3: 0000000003208000 CR4: 0000000000340ef0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.164216] Call Trace:
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.164628]  <IRQ>
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.165084]  igc_del_mac_steering_filter+0x4e1/0x2470 [igc]
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.165512]  __napi_poll+0x2c/0xb0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.165908]  net_rx_action+0xca/0x1f0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.166304]  __do_softirq+0xc9/0x20e
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.166694]  irq_exit+0x6f/0x80
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.167072]  do_IRQ+0x57/0xe0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.167444]  common_interrupt+0xf/0xf
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.167807]  </IRQ>
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.168159] RIP: 0010:cpuidle_enter_state+0x137/0x2c0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.168535] Code: 89 c6 e8 7c 0f 9f ff 45 84 ff 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 67 01 00 00 31 ff e8 40 fa a2 ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 88 c5 00 00 00 49 63 cd 4c 2b 75 d0 48 8d 04 49 48 c1
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.169694] RSP: 0018:ffffffff82203e08 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd7
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.170427] RAX: ffff88817c0246c0 RBX: ffffffff822550e0 RCX: 000000000000001f
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.170881] RDX: 000004a898279135 RSI: 00000000401a41a4 RDI: 0000000000000000
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.171286] RBP: ffffffff82203e38 R08: 0000000000000000 R09: 0000000000023f40
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.171678] R10: 000009513d583da4 R11: ffffc90005627cf8 R12: ffff88817a8ac400
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.172078] R13: 0000000000000003 R14: 000004a898279135 R15: 0000000000000000
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.172440]  ? cpuidle_enter_state+0x114/0x2c0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.172800]  cpuidle_enter+0x2f/0x40
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.173157]  call_cpuidle+0x1e/0x40
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.173516]  do_idle+0xf4/0x1c0
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.173861]  cpu_startup_entry+0x18/0x20
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.174201]  rest_init+0xa4/0xa6
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.174562]  arch_call_rest_init+0x9/0xc
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.174885]  start_kernel+0x57a/0x588
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.175200]  x86_64_start_reservations+0x29/0x2b
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.175511]  x86_64_start_kernel+0x71/0x74
Wed Sep 27 16:51:16 2023 kern.warn kernel: [ 5122.175824]  secondary_startup_64+0xa4/0xb0

not sure what the implications of this are.

ioogithub commented 9 months ago

Here is a real world example. eth2 went down:

Wed Sep 27 18:32:56 2023 daemon.notice netifd: Network device 'eth2' link is down
Wed Sep 27 18:32:56 2023 daemon.notice netifd: Interface 'wan2' has link connectivity loss
Wed Sep 27 18:32:56 2023 kern.info kernel: [ 6087.417383] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Down
Wed Sep 27 18:33:03 2023 daemon.notice netifd: Network device 'eth2' link is up
Wed Sep 27 18:33:03 2023 daemon.notice netifd: Interface 'wan2' has link connectivity
Wed Sep 27 18:33:03 2023 kern.info kernel: [ 6094.447380] igc 0000:04:00.0 eth2: igc: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Wed Sep 27 18:33:04 2023 user.notice post-tracking-post-tracking: wan2 (eth2) switched up
Wed Sep 27 18:33:04 2023 user.notice post-tracking-post-tracking: Interface route not yet set, set route ip r add default via x.x.0.1 dev eth2 metric 4
Wed Sep 27 18:33:06 2023 user.notice post-tracking-post-tracking: New public ip detected for wan2 (eth2): x.x.x.x
Wed Sep 27 18:33:06 2023 user.notice post-tracking-post-tracking: Reload MPTCP for eth2

but the this did not run:

omr-tracker.wan2=interface
omr-tracker.wan2.script_alert_up='/etc/init.d/glorytun restart'

I had to run this manually, then it returned to aggreate: /etc/init.d/glorytun restart.

Is there another way to get glorytun to restart when an interface goes down and up again automatically?

ioogithub commented 9 months ago

The problem is that this command does not persist across reboots:

uci set omr-tracker.wan2=interface && uci set omr-tracker.wan2.script_alert_up='/etc/init.d/glorytun restart'

the value is not written to /etc/config/omr-tracker

this setting does work, how can i get it to persist?