Ysurac / openmptcprouter

OpenMPTCProuter is an open source solution to aggregate multiple internet connections using Multipath TCP (MPTCP) on OpenWrt
https://www.openmptcprouter.com/
GNU General Public License v3.0
1.92k stars 274 forks source link

Realtime Traffic not updating every 3 seconds #946

Closed darthclide closed 4 years ago

darthclide commented 4 years ago

Expected Behavior

Realtime Traffic updates every 3 seconds like it says it should.

Actual Behavior

It sometimes goes 30+ seconds without updating. And sometimes it just completely freezes until I restart the router.

Steps to Reproduce the Problem

  1. Restart router
  2. Go to the Realtime Traffic page
  3. Observe the graph

Specifications

OpenMPTCProuter version: version 0.54.5 OpenMPTCProuter VPS version: Version 0.1009 4.19.80-mptcp OpenMPTCProuter platform: RPI4 Anker 4-Port Ultra Slim USB 3.0 Hub A7518 (this allows injection of power into the hub from an AC outlet) ZTE Velocity 2 Verizon Jetpack 8800L Motorola Moto E6 DSL plugged in using this USB adapter: https://www.amazon.com/Linksys-Ethernet-Chromebook-Ultrabook-USB3GIG/dp/B00LIW8TBG/ref=sr_1_3?crid=XE74SNCH5OTK&dchild=1&keywords=linksys+usb+3.0+ethernet+adapter&qid=1585263809&sprefix=linksys+usb+3.0%2Caps%2C164&sr=8-3 Jetpack 8800L and Motorola Moto E6 are plugged into the USB hub. ZTE Velocity 2 and DSL (USB ethernet adapter) are plugged directly into the RPI4.

What is really odd is that I know it has worked correctly other times. So why is it deciding to not work correctly now? I noticed my System Log is 6 hours ahead of my current time zone. But I haven't touched any time settings this whole time. Should I ensure that the linux reports the same time on both the VPS and my router?

Could this error in the log be related? image

Frankly if I could get it to stop freezing, this would be better than nothing. However, I am going to guess that whatever issue is causing the 10+ second reporting, is also causing it to freeze sometimes.

darthclide commented 4 years ago

Extra info: image

Could these errors in the kernel log be related? Even if it isn't related to the Realtime Traffic page, could you give me some ideas on figuring out what is causing it?

darthclide commented 4 years ago

Okay this is odd... It is fixed I guess? I updated the router, but I am guessing it was simply some setting that got screwed up and simply reflashing the same version would have fixed it.

I am not going to close this though until I hear back on the attached logs. Perhaps these are what led to the problem in the first place.

Also, is there any way to limit the following actions: image

For whatever reason, it keeps asking for settings on wan3 every minute (literally). Even if this isn't related to the Realtime Traffic page issue, I really don't think it is helping the stability of my setup if it can't stop spazzing out about wan3.

Ysurac commented 4 years ago

The "Get status and settings for..." should happen every hour. And for the log, the interface eth1 had a problem and doesn't answer to ping. Previous MPTCP error was releated to an interface not usable with the IPs in log.

darthclide commented 4 years ago

So if they "should" happen every hour, may I ask A. Why it keeps asking every minute and B. How to limit it? I just looked, and every time it says "done". It isn't like it is failing and trying again.

the interface eth1 had a problem and doesn't answer to ping.

Can you go into detail? What would cause this problem? Is this a ping to the hotspot? Or is it a ping to the APN (tower)? Why would it fail, if it just successfully completed getting the settings a moment ago?

Previous MPTCP error was releated to an interface not usable with the IPs in log.

Again, can you explain what do you mean "not usable with the IPs". What would cause this?

Ysurac commented 4 years ago

I don't see a check every minutes in your logs. You can't limit it. The status page doesn't warn about filesystem read only ?

You can see the detail in the log. It's not possible to ping via eth1 the ips 1.0.0.1 and 80.67.169.12. The settings retrieval is not related to connectivity.

The interface who have IPs that you can read in your log is not usable. This can be caused by an interface that is plugged/unplugged, a hardware error, a driver error, a power error, a wrong configuration,...

darthclide commented 4 years ago

I don't see a check every minutes in your logs. You can't limit it. The status page doesn't warn about filesystem read only ?

This is because I didn't want to make the post super long. I will now paste a snippet showing it constantly saying "Done" as if there are no errors. And no, it isn't warning about filesystem read only. I did encounter this issue a couple weeks ago after I rebooted too many times back to back. I just flashed the SD card back to stock though which obviously fixed the read only issue. image

Why is it constantly checking for settings if it is just coming back with "done"?

You can see the detail in the log. It's not possible to ping via eth1 the ips 1.0.0.1 and 80.67.169.12. The settings retrieval is not related to connectivity.

Do you mean it is impossible for any interface to ping 1.0.0.1 and 80.67.169.12 because they are not real IPs? And won't it cause instability if it is constantly failing with these pings?

The interface who have IPs that you can read in your log is not usable. This can be caused by an interface that is plugged/unplugged, a hardware error, a driver error, a power error, a wrong configuration,...

Okay, then this must be when I turn on tethering on my T-mobile phone.

Ysurac commented 4 years ago

No error in the log, it's because your interfaces are up/down/up/... After an interface was down a new "get status and settings" is launched.

1.0.0.1 etc... are real IPs. You should be able to ping it via each interfaces.

darthclide commented 4 years ago

When you say "interface was down", do you just mean the ping fails? Or do you mean it is actually failing due to some kind of power issue?

Can you tell me how to customize the router so it is less sensitive with my Verizon? (that is wan3)

Ysurac commented 4 years ago

I mean omr-tracker detect it as down, so ping test fails. By default omr-tracker ping the gateway of the interface and if ping doesn't answer then ping IPs in omr-tracker ip list.

You can configure omr-tracker in Services->OMR-Tracker.

darthclide commented 4 years ago

So you are telling me that it is failing to ping my hotspot... This is really not good. I don't think I need to look at your tracker list if I can't even ping the default gateway of the device. I have a bit of thinking to do now. Unless you can give any suggestions to gain more stability with a hotspot over USB tether.

Ysurac commented 4 years ago

Some router/dongle/... doesn't answer to ping. You can test with ping -B -I <yourinterface> <yourgateway> to test if the gateway answer sometimes or not at all. Same to test external IP. Check also that ICMP/Ping is not filtered on this connection, if it's the case you can use another test mode in omr-tracker like httpping.

darthclide commented 4 years ago

Yes, before I read your comment I just decided to ping 192.168.124.1 -t from my PC, and it is always 1ms, even at the same time as your router saying "eth1 switched off because ping from 192.168.124.16 error (80.67.169.12,80.67.169.40)". I have now done a ping from your software to the gateway and it is correctly 1ms or less.

But this is really strange. I refresh the log to see a statement like this: "eth1 switched off because ping from 192.168.124.16 error (114.114.115.115,4.2.2.1)" And if I immediately start a ping with my eth1 interface to 4.2.2.1, it works perfectly fine. No spikes in latency after 20 pings either.

Any ideas on why doing a manual ping would work, but your software thinks there is a failure?

Ysurac commented 4 years ago

It only do a ping by the command I give you, try again.

darthclide commented 4 years ago

Hmmm, caught a hiccup. It paused on 80.67.169.12 and I quickly swapped to 4.2.2.1 and it was also stuck for about 5 seconds before it started pinging. So it seems pinging externally gives out from time to time. Is there any way to increase the amount of ping attempts before it thinks the interface is down? I have solid 3 bars (sometimes up to 4) of signal. So there should never be any ping issues... I am at a loss as to why this is happening.

Ysurac commented 4 years ago

All settings are in Service->omr-tracker

darthclide commented 4 years ago

image Which setting will do what I want? retry interval? or tries?

Also, I just did a very long ping test to 80.67.169.12 via eth1 up to icmp_seq=782 and I had 0% packet loss... It is almost as if sometimes the first ping gets stuck for a moment and then continues? Like the internet went to sleep on it for a moment? But that can't be. There is always some traffic going through the device. I can see that under Realtime Traffic. I am going to go ahead and try a continuous ping test for 30 minutes and report back any packet loss. If I get 0% packet loss, then we will know it is something weird about it starting up a ping for the first time in a few minutes.

Ysurac commented 4 years ago

You can try to increase timeout an tries.

darthclide commented 4 years ago

Okay, I went 45 minutes pinging 80.67.169.12 and I dropped a packet only 1 time. I will go ahead and try increasing the timeout and tries though.

darthclide commented 4 years ago

Okay so far so good. It isn't spamming "get status" messages anymore. However, what do you make of these messages? image

Is there a way to increase the maximum? Are there any downsides to doing this?

And what do you make of the kern.err message?

Ysurac commented 4 years ago

Remove tun0 from vnstat in System->vnstat. tun0/omrvpn should not be in vnstat configuration (else this put these messages).

darthclide commented 4 years ago

I do not see a vnstat option? image

Ysurac commented 4 years ago

In status menu

darthclide commented 4 years ago

image tun0 is not here?

darthclide commented 4 years ago

I am okay with editing a conf file through the terminal, but... I would rather fix this the "right" way since it might indicate a bigger underlying problem.

Ysurac commented 4 years ago

A new VnStat release with a new configuration, init,... will be available in next release and this will fix this issue.

darthclide commented 4 years ago

Hmmm, may I ask if this limit is slowing my internet down? Because if this is only affecting 5% of my speed then I can wait until your next release. But if you think it is having a much bigger impact, I hope you can tell me how to fix this in the meantime.

Ysurac commented 4 years ago

Not at all. It's only writing dirty log.

darthclide commented 4 years ago

Okay then. I look forward to that release then. I will keep this thread open a little longer to A. Ensure the original bug of Realtime Traffic getting stuck doesn't come back and B. That the issue of spamming "get settings" in the log does not return.

darthclide commented 4 years ago

Since I don't want to clutter your page with new issues, I was hoping you could help me understand why the log spams out this message when I enable handover mode for my T-mobile: image

It also says this on the status page: image

Basically I am just trying to get T-mobile to be used in case AT&T gives out completely (which is already very odd to me. I can understand congestion making it drop down to 500 kbps, but to drop to nothing? I am waiting for the billing cycle to end and see if this issue is still there. But in the meantime, no matter what combination of settings I use in Multi-Path, I can't seem to get the following for my Twitch stream:

  1. Use only AT&T and DSL for the stream. Do not dare touch my Verizon since it is very shaky and is often fooled into thinking it has plenty of upload, when it actually only has 500 kbps. (this is what caused my problems in the first place. For whatever reason, your software just sits there trying to push 2000 kbps through Verizon when it clearly can't handle it, instead of using "handover" or "backup" on my T-mobile.
  2. Fall over to T-mobile if AT&T just gives up completely on upload.
  3. Immediately switch back to AT&T once it sees it is functional again.
Ysurac commented 4 years ago

Don't use handover mode, it's not tested at all. Interface set in backup mode are used if other interfaces are down.

darthclide commented 4 years ago

But how can I define "interfaces down"? Because according to this: image

My stream won't be using Verizon (wan3) because it disabled. And it should be switching over to T-mobile (wan2) if AT&T (wan1) gives out. But after 2 tests, AT&T drops to 20 kbps upload, and your software tries to push through the all the upload on my DSL alone. It also locks up my OBS streaming software and I have to end the process.

Ysurac commented 4 years ago

In your config, wan2 should be used if wan1 and wan4 are down.

darthclide commented 4 years ago

I guess this is the problem. These interfaces are not going "down", they are just simply dropping in speed to super low levels. How can I tell your software to use another interface based off speed and not if it is "down"?

Ysurac commented 4 years ago

Not possible. Maybe with a custom MPTCP scheduler...

darthclide commented 4 years ago

So ironically even though my connections are connecting perfectly to the VPS, it doesn't account for speed dropping to near zero... That is unfortunate. I am assuming since this is bleeding edge tech, you wouldn't know where I could find a person who designs such a custom scheduler?

Ysurac commented 4 years ago

You can check if there is a thesis about that. But It's difficult to detect that an interface have speed dropping.

darthclide commented 4 years ago

Hmmm, a thesis? I wouldn't know where to begin or what search terms Google would understand. :/

But what about this: Unfortunately my T-mobile is right around the same ping as AT&T. This means that even when my wan1 (AT&T) is set as master, it rarely uses my AT&T for the stream. Since my stream is almost always using only 2 interfaces (5500kbps), is there a way I could schedule certain times of day to turn off/on interfaces?

darthclide commented 4 years ago

Or if you have a better idea to level out which devices get used (that doesn't involve bandwidth limiting, because I have determined that my mobile connections never play nice with it)

darthclide commented 4 years ago

I have managed to get some stability out of Verizon (2500 kbps upload), but because its ping is 100-120ms, it is never used for my Twitch stream unless I put an upload limit on the other interfaces. But this isn't going to work because I need my parents and guests to be able to use the full potential of T-mobile and AT&T if they are sending photos or doing facetime. Setting it as master won't help, because I have tried setting AT&T as master, and it just uses up my T-mobile bandwidth by default. Any ideas on what to do?

darthclide commented 4 years ago

Still hoping for some ideas on this situation.

Another thing: My Verizon sometimes cuts out completely, and it never comes back. But if I just disable/enable Verizon in Multipath it starts working again just fine. A. How can I stop it from failing completely? I can't increase your "tries" any higher in the tracker. And honestly I would prefer to have independent tracker settings for each wan. B. Is there any way I can script it to disable/enable Verizon if it detects 0 traffic over it after 5 seconds?

Ysurac commented 4 years ago

A: You can have independent tracker setting for each wan, in interface section at the end of omr-tracker page you can add an interface and set custom settings. B: If you increase tries and want it after 5 seconds, it's strange... Else it's what omr-tracker do.

Interfaces are used based on latency, lower lantency is used first. Without using a custom MPTCP scheduler there is no way to change that for now.

darthclide commented 4 years ago

A. Do I just add the same IPs used in the default tracker? Also, underneath the wan3 I added it does not say: image like it does under "Default Settings". How do I tie it into scripts? Or is it just automatic?

B. But that isn't my issue. The problem is that it stays permanently off until I disable/enable it in multipath. I am assuming it is something to do with how the Twitch stream is sent, so I need a way to disable/enable my Verizon automatically. Or if you know a better way to get Verizon to re-activate?

Without using a custom MPTCP scheduler

Could you direct me to a place where people are working on new schedulers? Or if you know someone that I could pay to work on a custom one?

Ysurac commented 4 years ago

A. If no IPs are defined, this use defaults IPs. And it's automatic. B. I don't understand the issue. When it's disabled ?

There is ProgMP, but it doesn't seems to updated for latest MPTCP release. There is also mptcpd by Intel, should be possible to add a plugin that do what you want. No idea who can do that.

darthclide commented 4 years ago

A. Will it overrule the default settings for that wan? Or is it going to overlay on top of it? B. Basically the stream will show a solid split between all my wans, but then Verizon drops out completely. And it stays off no matter how long I wait for. But simply disabling the multipath for it, and enabling it again (takes about 5 seconds to apply changes) immediately starts sending Verizon to Twitch again.

I went ahead and sent an email to Alexander Frömmgen in the hopes he might update ProgMP, or help me get mptcpd working. I hope it isn't too complicated installing an additional scheduler in your software?

Ysurac commented 4 years ago

A. It overlay. If a setting is set, this use this new setting, else default is used. B. What do you have in State->System log when it drop ?

For installing a scheduler you need to compile the kernel module for the kernel and arch used.

darthclide commented 4 years ago

B. I don't see anything out of the ordinary other than a few "daemon.err /usr/bin/ss-redir[29443]: remote recv: Operation timed out" and "daemon.err /usr/bin/ss-local[29440]: getpeername: Socket not connected" messages. But they do not match up with the same time I encounter these stream drops.

From what I can tell, I can get rid of these socket and timeout errors if I just rely upon your default tracker settings. But I really don't like seeing it checking status on wan3 every 2 minutes. If you don't think that will hurt anything, I suppose I can set it back to your default tracker.

darthclide commented 4 years ago

A thought occurred to me: Is there any way to look at specific MAC addresses data usage? Like see who is using the most bandwidth at a specific moment?

darthclide commented 4 years ago

These errors popped up in my kernel log: [64446.794144] ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST unsupported, err=-512 [64446.805459] brcmfmac: brcmf_cfg80211_dump_survey: Could not get noise (-512)

darthclide commented 4 years ago

Also out of nowhere, my stable AT&T connection just dropped out... Followed by tons of these messages: image

After waiting for 2 minutes, I just changed my master over to T-mobile, and immediately my AT&T came right back up as if there were 0 problems. Something is funky with how my wans are connecting to the VPS. It is like they just permanently give up, and never even try to reconnect? But then magically connect just fine if I swap my masters around.

Ysurac commented 4 years ago

This can happen if latency is high and/or many packets are lost.