Ysurac / openmptcprouter

OpenMPTCProuter is an open source solution to aggregate multiple internet connections using Multipath TCP (MPTCP) on OpenWrt
https://www.openmptcprouter.com/
GNU General Public License v3.0
1.8k stars 258 forks source link

OMR causing LAN port boot loop when obtaining DHCP info with x86 hardware and dedicated LAN ports. #2584

Closed ioogithub closed 1 year ago

ioogithub commented 1 year ago

Expected Behavior

When wan1 and wan2 are connected to OMR router and configured with DHCP, IP will be obtained and link will remain stable and active.

Current Behavior

When wan1 (starlink) and wan2 (4g) are connected to OMR router with direct LAN ports, DHCP info (IP/gateway etc) for each device is assigned to the interface, internet connection comes up for a few seconds and status page is green, then the link is terminated. Link and activity lights on the RJ45 socket are turned off. After a few seconds lights are on again, and this loop continues.

Possible Solution

No idea but I do not see this behavior on other Linux systems with same hardware.

Steps to Reproduce the Problem

  1. Install OMR on device.
  2. Boot, assign DHCP to wan1 and wan2 in the System->OpenMPTCPRouter->Wizard
  3. Observe both interfaces will go green and get IP addresses, internet will be available.
  4. After 10-15s observe that both interfaces will go red and OMR will turn off the LAN port (no physical lights)
  5. Observe that both interfaces will go green again.
  6. Repeat ...
  7. Tested with v0.59.1-5.4 r0+16594-ce92d and 5.15 kernel v0.59.1-5.15-r0+20029-3c06a344e9-x86-64-generic-ext4-combined-efi

Context (Environment)

Other testing:

Specifications

Tested on:

  1. OpenMPTCProuter version: openmptcprouter v0.59.1-5.4 r0+16594-ce92d
  2. OpenMPTCProuter version:openmptcprouter-v0.59.1-5.15-r0+20029-3c06a344e9-x86-64-generic-ext4-combined-efi
    • OpenMPTCProuter VPS version: wget -O - https://www.openmptcprouter.com/server/debian-x86_64.sh | sh
    • OpenMPTCProuter VPS provider: linode and digital ocean
    • OpenMPTCProuter platform: x86_64 and x86_64 with 5.15 Kernel
    • Settings: all defaults, only LAN IP and DHCP for wan1, wan2 set.

Logs

I can't currently use OMR with this issue so I am available to test, please let me know what additional data is needed to troubleshoot the issue and I will attach it to the issue.

Network-Traditions commented 1 year ago

We experienced this with v0.58.5 r0+16336-b36068d35d on x86 per #2359:

Subsequently, I will continue testing v0.59 as it evolves and provide feedback where it will be helpful to the maturation of the project. In closing, while the MODEMMANGER significantly established a stable and functioning wan interface, the Intel 225V revision 3 2.5GB NICs with the kmod-igc driver is less stable. The observed instability to date has been an up down looping under certain conditions the specifics of which are unknown at this time. When reconnecting my StarLink v2 configured as a bridge mode DHCP eth1 connection member of MPTCP, the DHCP ip address assignment connects then disconnects every few seconds if the interface is brought up by anything other than the openMPTCProuter settings wizard. Using the settings wizard, the interface is brought up, obtains its DHCP ip address and begins to function nominally.

and our recent experience with v0.59.1-5.4 r0+16594-ce92de8c8c on x86 per #2550:

For our Starlink v2 with the Ethernet adapter, we are using "bypass mode" so Starlink supplies a bridged DHCP IP address to the connected OMR Ethernet interface. We've had a similar experience with the wwan ModemManager interface for our USB 3.0 5G modem as well. At times, we have found these configurations benefit from the "Force link" checked in Network-Interfaces-Advanced Settings tab of the aforementioned OMR interfaces. Unchecked, 5G and/or Starlink will sometimes end up in a connect/disconnect loop for reasons unknown at this time.

Kalimeiro commented 1 year ago
  • When I installed OMR and configure the wan1 and wan2 interfaces for DHCP I get the interface boot loop described above.

  • If I connect the USB-> LAN adapters, everything works as expected.

When you use LAN adapters, you separate physicaly each network connection to avoid dhcp request send/received to/from wan1/wan2 and local network, i think this is correct and probably work perfectly.

When you not use LAN adapters, do you use VLAN to separate virtually each network connection ??

For SQM, in my experience, i totaly disable this including the default tun0.

ioogithub commented 1 year ago

"Force link" checked in Network-Interfaces-Advanced Settings tab

This is already checked by default so I am already using it.

Using the settings wizard, the interface is brought up, obtains its DHCP ip address and begins to function nominally.

So how were you able to fix the issue? I am not using modem manager and my starlink is bypass mode. However I am seeing the disconnect on both the starlink and the 4g routers as well so I do not think it is related to the starlink. What did you ultimately do to get it working? I have a similar box to the one you purchased (same brand) but I cannot get a connecting, I am already using 0.59 version and I tested with the 5.15 kernel as well.

Were you ultimately able to get it working?

ioogithub commented 1 year ago

When you use LAN adapters, you separate physicaly each network connection to avoid dhcp request send/received to/from wan1/wan2 and local network, i think this is correct and probably work perfectly.

When you not use LAN adapters, do you use VLAN to separate virtually each network connection ??

The LAN adapters act as a seperate physical interface. The new x86 box has 4 build in LAN ports so the exact same configuration in terms of networking, they aren't acting as a switch or anything so no VLANs involved. It is basically a computer with 4 NIC ports.

The only difference between using the USB adapters and the box with 4 NICs is that there is a big advantage to the box because each physical LAN port on the new box has it's own network chip on the system board so it eliminated the possability of a bandwidth bottleneck on the USB bus. With the USB->LAN adapters all the data is processed on the same USB bus. This x86 setup is a much better setup however it doesn't work at all.

Ysurac commented 1 year ago

I think you removed some info from log, so what is wan_ip1 and wan_ip3 here (and why a so short lease time ?):

Wed Sep 28 21:49:08 2022 daemon.notice netifd: wan2 (17160): udhcpc: broadcasting select for wan_ip1, server wan_ip3
Wed Sep 28 21:49:08 2022 daemon.notice netifd: wan2 (17160): udhcpc: lease of wan_ip1 obtained from wan_ip3, lease time 300

Also did you try with only one interface enabled ?

ioogithub commented 1 year ago

I think you removed some info from log, so what is wan_ip1 and wan_ip3 here (and why a so short lease time ?):

Sorry, I tried to redact the IP info. So wan_ip1 and wan_ip2 is the IP, gateway etc from the router/moden device. In this case it was the 4g. It actually sets the info, I can see a green checkmark on the status page briefly. I am pinging a server and I go online get 10 pings and then the interface is restarted (lights on the NIC go out) and I lose the ping and you see it tries to connect over and over.

Yes from this log I was only trying the 4g device, but I see the same with the starlink. Both of these devices as assigning the DHCP leases, IP etc. I do not control either of them or the short lease times.

Okay I think I made some progress, I tried setting "Force link" checked in Network-Interfaces-Advanced Settings tab" as @Network-Traditions suggested and at first it looks like it actually worked! I got a DHCP address the first time and maintained it for 2 or 3 minutes but then I see a crash:

Thu Sep 29 17:10:22 2022 kern.info kernel: [65191.027938] logd[4188]: segfault at 7fc9eed89902 ip 00007fc9eedd5da9 sp 00007ffdc512d6e0 error 4 in libubox.so.20220515[7fc9eedd3000+5000]
Thu Sep 29 17:10:22 2022 kern.info kernel: [65191.027956] Code: 10 48 89 ee 4c 89 ff 4c 89 f5 e8 25 fc ff ff eb 9f 48 8b 83 b0 00 00 00 48 85 c0 74 08 44 89 e6 48 89 df ff d0 45 85 e4 74 1b <80> bb e2 00 00 00 00 74 12 83 7b 58 00 75 0c 48 8d 7b 70 31 f6 67
Thu Sep 29 17:11:25 2022 daemon.notice netifd: wan2 (19306): udhcpc: sending renew to server gateway
Thu Sep 29 17:11:25 2022 daemon.notice netifd: wan2 (19306): udhcpc: lease of ip obtained from gateway, lease time 300
Thu Sep 29 17:13:56 2022 daemon.notice netifd: wan2 (19306): udhcpc: sending renew to server gateway
Thu Sep 29 17:13:56 2022 daemon.notice netifd: wan2 (19306): udhcpc: lease of ip obtained from gateway, lease time 300

This is with the 5.15 kernel.

So with this force network setting at least I am online for a few minutes rather than a few seconds but crash is not good. Any suggestions on what to try next?

ioogithub commented 1 year ago

What does "Force link" do it looks like it may have actually worked for a moment.

The description is: Set interface properties regardless of the link carrier (If set, carrier sense events do not invoke hotplug handlers).

Are there any other settings I can try?

ioogithub commented 1 year ago

I just discovered another issue that looks to be similar to mine here: https://github.com/Ysurac/openmptcprouter/issues/2548

@ahayes, @kb1isz look to be affected as well. Did you guys find a solution?

It is a bit different in that they are reporting they never get a connection, I do get a connection briefly but then I get disconnected. But one thing in common is we are all using the i255 NIC cards.

Were you guys able to test DHCP with another linux distro on your box to see if you get a DHCP address normally?

Perhaps you can try the "Force link" setting (Network->Interface->Edit->Advanced Settings) that @Network-Traditions suggested. It seemed to work for me until I get the segfault but this may be because I am using the 5.15 kernel? Are you guys on the regular kernel, perhaps you can try it.

Ysurac commented 1 year ago

The logread crash is not a big problem, it's not a kernel crash, it's only log service crash.

Ysurac commented 1 year ago

In all cases it seems that I25X cards are a problem with DHCP. Need to find a patch to igc driver or if an OpenWRT patch is not the problem...

ahayes commented 1 year ago

@ioogithub Thanks for tagging me. I have actually seen both behaviours. The first was mentioned in my original (now closed) issue https://github.com/Ysurac/openmptcprouter/issues/2544#issuecomment-1250092159. But I didn't spend a lot of time on it because I was just testing to see if the adapter was recognized with kernel 5.15 before moving on to trying to get it to grab an IP via my bridging cable modem in issue #2548 that you mentioned. I did see your looping issue again though when I eventually decided to stick an Edgerouter back in between to grab the upstream DHCP so I could configure my OMR interface for static IP. That is still my setup although when I have a chance I will probably try out the proxmox workaround that @kb1isz is using.

The interfaces were able to come up and function using DHCP when I was booting a live Ubuntu 22.04.1 ISO.

ioogithub commented 1 year ago

In all cases it seems that I25X cards are a problem with DHCP. Need to find a patch to igc driver or if an OpenWRT patch is not the problem...

I did test on ubuntu and it seemed to work so perhaps it is the igc driver? I guess a good test would be to load an openwrt image and see if the problem exists there.

The logread crash is not a big problem, it's not a kernel crash, it's only log service crash.

Okay I think I am making some progress, since the log crash I have had a stable connection with the Force link settings. Is it safe to keep this setting on? What does it do?

Now I may have another issue. The Starlink connection is reporting "Multipath seemed to be blocked on this connection" on the status page.

I know mptcp works with Starlink because I have been using it for 2 weeks.

  1. When I do a MPTCP support check on the starlink wan interface I get a Senders key.

  2. MPTCP Full Mesh shows me this:

IP1 id 4 subflow fullmesh dev if41 
IP2 id 10 subflow fullmesh dev eth1 
IP3 id 11 subflow fullmesh dev eth2 

IP3 is starlink.

  1. If I do a `omr-test-speed' I can see traffic only going out though the 4g connection, not starlink. Any ideas?
ahayes commented 1 year ago

Ubuntu 22.04.1 looks like it might use Linux kernel 5.17. Might be worth looking at igc commits between 5.15 and 5.17 for any changes impacting i225 and i226 adapters.

Ysurac commented 1 year ago

force link is the default setting, so no problem. @ioogithub You are using 5.15 kernel ? Do you have also 5.15 kernel on the VPS ? (the multipath support on 5.15 kernel doesn't always work)

ioogithub commented 1 year ago

I eventually decided to stick an Edgerouter back in between to grab the upstream DHCP so I could configure my OMR interface for static IP.

Okay so you fixed the issue but adding hardware between OMR and modem so a work around.

ioogithub commented 1 year ago

force link is the default setting, so no problem. @ioogithub You are using 5.15 kernel ? Do you have also 5.15 kernel on the VPS ? (the multipath support on 5.15 kernel doesn't always work)

Actually I think it is the default for static IP but I had to activate it manually for DHCP, I was mistaken before when I mentioned it was the default. If it was the default I would have never even known there was a problem, could you make it the default?

No! I am not using the 5.15 kernel on the VPS!

Linux vps 5.4.207-mptcp #1 SMP Sun Jul 24 14:39:44 UTC 2022 x86_64 GNU/Linux

What is the quickest way to make the change and upgrade the kernel? Should I upgrade the VPS to 5.15 or should I downgrade OMR to the previous version? Stability is my main goal. How beta is the 5.15 build?

Ysurac commented 1 year ago

5.15 is a test release, so it may or not work. Sometimes work with 5.4 VPS, sometimes not (test via SSH on the VPS sysctl net.mptcp.mptcp_version, this should return 1). You can upgrade the VPS by running command in the doc.

ioogithub commented 1 year ago

Sometimes work with 5.4 VPS, sometimes not (test via SSH on the VPS sysctl net.mptcp.mptcp_version, this should return 1). net.mptcp.mptcp_version = 1

mptcp is reported as working on the 4g interface but not the starlink.

You can upgrade the VPS by running command in the doc.

I'll try upgrading the VPS first to see if it fixes the issue. If not I will revert OMR and VPS back to 5.4.

Is the DHCP code from openwrt or is it modified by you for this project? If the issue with the i255 is a openwrt driver issue there should be users in the openwrt forum with the same issue. I have seen other reports of users having issues with early version of the i255 but since the b3 stepping revision I haven't seen anything or anything specific to DHCP.

Has anyone else seem DHCP problems with i255 reported by openwrt users?

ahayes commented 1 year ago

I eventually decided to stick an Edgerouter back in between to grab the upstream DHCP so I could configure my OMR interface for static IP.

Okay so you fixed the issue but adding hardware between OMR and modem so a work around.

Yeah. It was a workaround which is why I didn't close the issue.

My VPS is Ubuntu 20.04. I didn't check what kernel it had when it was fresh but now after installing OMR via this documentation it is running kernel 5.4.207-mptcp and has been working well enough.

P.S. The adapters in question are Intel Ethernet Controller I226-V and Intel Ethernet Controller I225-V. The I22x moniker seems to be getting a bit mixed up by folks as this issue goes along. Worth correcting for the benefit of search engines and future people.

Network-Traditions commented 1 year ago

We did ultimately install a recent version of OpenWrt when first dealing with this issue to evaluated the I225 version B3 issue. It worked flawlessly with OpenWrt. When were were using v0.58.5, we deleted all references to the StarLink Ethernet NIC in "Network-Interfaces" and reset it in "Network-Devices". Next we added the Starlink Ethernet interface at the bottom of "System-OpenMPTCProuter-Settings Wizard" and appropriately completed all the available entries for interface and clicked "Save & Apply". Then we adjusted entries not exposed by the wizard in "Network-Interfaces" and "Network-Devices". It didn't work all the time, but we repeated the process until the Starlink DHCP worked and remained connected. Since Starlink IP addresses change often, static assignment is not really an option.

When we moved on to v0.59.beta6, we had to manually download and install the kmod-igc package. In this configuration the Starlink DHCP was more stable, but occassionally entered the disconnect loop. We power cycled the Starlink router, restarted the interface, rebooted and power cycled OMR until the connection remained. Once it did, it seemed to stay in place until we implemented some configuration updates or testing that interrupted the connection.

With the current version v0.59.1, this has almost been a non-existent issue. Only occassionally do we find the "Force Link" option helpful and we have been experiementing with leaving it uncheck in order to evaluate the pros and cons of OMR having better feedback regarding the Starlink and T-Mobile connections.

We did try the v059.beta6 with the 5.15 kernel and once again had to manually download and install the kmod-igc package. This was highly unstable and a number of missing dependencies were reported. It was at this time we tested the OpenWrt project with success so we realized the kmod-igc package would eventually be updated to a more reliable version.

Today, our Starlink i225 interface has been connected for 1 day and 18 hours and has received 71.49 GB and transmitted 6.17 GB during the cycle. We've been in an aggressive testing mode and are regularly rebooting and reconfiguring the system.

ioogithub commented 1 year ago

With the current version v0.59.1, this has almost been a non-existent issue. Only occassionally do we find the "Force Link" option helpful and we have been experiementing with leaving it uncheck in order to evaluate the pros and cons of OMR having better feedback regarding the Starlink and T-Mobile connections.

So you are saying that you don't need the Force link connection with v0.59.1? I have done a ton of testing over the past two days and for me, if this setting is unchecked I get a NIC boot loop every few seconds, as soon as I turned it on, it seems to work.

@Ysurac the check is not default for DHCP, as soon as you change from static to DHCP it is unchecked, I verified several times. Perhaps you can add it to the wizard? Right now it is sort of buried in the interface advanced setting and unless you know exactly what you are looking for and search these issues it will be easy to miss. If it is moved to the wizard and even set by default then the wizard will work immediately.

Network-Traditions commented 1 year ago

Because there are times where it seems to be necessary, I'm leaving it on permanently. The 5G USB 3.0 ModemManager wwan0 interface behaves quite similarly and can benefit from the force link setting. Currently, we have it unchecked. Approximately 45 minutes ago, it was in a connect/disconnect loop. In this instance, a "Restart" of the interface did not resolve the issue as it sometimes does so we disconnected, paused for 10 seconds and reconnected the USB cable connection. The wwan0 interface has been online since then. While we have a T-Mobile static public IP address for our modem, I believe from ModemManager's point of view it still sees it as a DHCP connection.

ioogithub commented 1 year ago

Approximately 45 minutes ago, it was in a connect/disconnect loop.

Are you currently using the HUNSN device with the 4 NICs that you reported earlier or the PC ?

ioogithub commented 1 year ago

With the setting on I have had a stable link for a few hours now. I will start to test the bonding bandwidth again.

If I have a stable connection after 24 hours I would say this issue is solved and I will close the issue. Thanks for the trip @Network-Traditions. We really need to figure out a way to better organize this information.

@Ysurac is it possible for users to add info to the wiki? There are so many little pieces of info that are in the issues but it would be nice if we have a troubleshooting wiki entry or a device wiki entry, for this Intel i225 chip for example. There are a few tutorials in the issues that are almost complete such as the wireguard tutorial as well.

Network-Traditions commented 1 year ago

Yes, I've updated our production HUNSN device because of the good results we observed on our Lenovo test system. That hardward however, does not have the Intel i225 NICs so I was curious how that was going to behave. Fortunately, as I stated previously, it is far more stable than our first experiences. @Ysurac your dedication and progress on this project is greatly appreciated!

Network-Traditions commented 1 year ago

@Ysurac @ioogithub as a result of realizing StarLink has an unusual DHCP configuration I stumbled across this issue with OpenWRT users that may be a more significant factor contributing to the DHCP boot loop than the interface (i.e. i225V B3 NICs or USB 3.0 ModemManager): https://nelsonslog.wordpress.com/2021/04/07/openwrt-vs-starlink-dhcp-leases/

I could see cellualr services doing similar things that create the problems like StarLink's DHCP connection procedure.

hle5128 commented 1 year ago

im able to get this going by changing to static IP instead dhcp, no crash yet, but as soon I switch wan to DHCP, crash loop is back

Network-Traditions commented 1 year ago

@hle5128 do you have the "Force link" option checked in the Network-Interfaces-Advanced Settings tab for the interface you have configured as DHCP? image

hle5128 commented 1 year ago

Yes i do.I have it turned on first thing setup.On Nov 10, 2022, at 12:07 PM, Network Traditions LLC @.***> wrote: Do you have the "Force link" option checked in the Network-Interfaces-Advanced Settings tab for the interface you have configured as DHCP?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

Network-Traditions commented 1 year ago

Interesting, that setting usual resolveds the DHCP loop. When we started working with OMR v0.58.5, this setting didn't always work. At that time, the Intel I225 drivers were not stable. Regarding the Intel I225 NICs, if you don't have the "B3" version, this issue may not be resolvable. To clarify the issue, we installed a current release of OpenWRT on our hardware, which confirmed everything worked perfectly lending credibility to the idea something in the OMR software stack was the source of our problem at that time. To resolve the issue we attempted a number of resolutions and succeeded in obtaining a reliable DHCP assignment and functional WAN connection. Not knowing for sure what and if any of the following procedures resloved our issue, the following actions were taken:

  1. Deleting all WAN interfaces from Network-Interfaces and resetting their respective configurations in the Network-Devices tab. Then we used the System-OpenMPTCProuter-Wizard to "Add an interface" at the bottom for the repsective WAN interfaces. We then returned to the Network-Interfaces and Network-Devices tabs and completed the configuration for settings not exposed in the System-OpenMPTCProuter-Wizard.
  2. Tried the "Force link" option checked in the Network-Interfaces-Advanced Settings checked and unchecked.
  3. Tried the "Use broadcast flag" option checked in the Network-Interfaces-Advanced Settings checked and unchecked.
  4. Set the "Multipath setting" to disabled in the Network-Interfaces-Advanced Settings and brought the interface online before adding it to the OMR WAN pool by switching this setting to enabled.

Hope this helps.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days

khdegraaf commented 11 months ago

I believe the ultimate cause of this issue is outlined in https://github.com/Ysurac/openmptcprouter/issues/3005. I found a fix/work-around for anyone interested in this.

hle5128 commented 11 months ago

Can you post your fix?On Oct 20, 2023, at 5:37 PM, Kevin DeGraaf @.***> wrote: I believe the ultimate cause of this issue is outlined in #3005. I found a fix/work-around for anyone interested in this.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

khdegraaf commented 11 months ago

I tried two different things, just commenting out the line in 00-nego, or changing the || to a &&. The second fix is what Ysurac commited to develop here: https://github.com/Ysurac/openmptcprouter-feeds/commit/666b8fbbdddd278ddb5732227c81ee0468936f27.