LibreQoE / LibreQoS

A Quality of Experience and Smart Queue Management system for ISPs. Leverage CAKE to improve network responsiveness, enforce bandwidth plans, and reduce bufferbloat.
https://libreqos.io/
GNU General Public License v2.0
454 stars 48 forks source link

QOS Stops Working #286

Closed nickabocker171 closed 1 year ago

nickabocker171 commented 1 year ago

Hi i wanted to see if i could get some assistance with this software. We got the server setup and deployed on a test site with an OSPF connection. Everything appears to be working and then suddenly the OSPF breaks my IPV4 traffic and i have to quickly terminate the OSPF to avoid an outage. his would happen about every 30 minutes but today i ran it from 5:30 am to about 8:00am before it started to crash. Im curious if you know why?

ericinidahofalls commented 1 year ago

Here is some info about the setup as well as a few things we've noticed:

Sites: 1 Shaped Devices: 640 traffic running through the bridge/lqos: approx 700mbps

About the server: Ubuntu Server 22.02 LTS, bare metal Intel X710 network card Dell PowerEdge R440 LibreQOS V1.4

CPU Info: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 12 Socket(s): 1 Stepping: 4

Network Card: 65:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02) 65:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)

network.json: { "ina2": { "downloadBandwidthMbps":8000, "uploadBandwidthMbps":8000 } }

ShapedDevices.csv example: "94_1010","vlan1010","","","ina2","","100.69.20.0/29","2604:0000:0:be00::/64",10,2,43,11,"" "94_1011","vlan1011","","","ina2","","100.69.20.8/29","2604:0000:0:be04::/64",10,2,43,11,""

Looking at syslog between when lqosd was started until IPv4 tanked: Mar 22 11:32:49 test-qos systemd[1]: Started lqosd.service. Mar 22 11:32:49 test-qos systemd[1]: Started lqos_node_manager.service. Mar 22 11:32:50 test-qos lqos_node_manager[11623]: Error: Unable to access /run/lqos/bus. Check that lqosd is running and you have appropriate permissions. Mar 22 11:32:50 test-qos lqos_node_manager[11623]: Error: Unable to access /run/lqos/bus. Check that lqosd is running and you have appropriate permissions. Mar 22 11:32:50 test-qos lqos_node_manager[11623]: thread 'rocket-worker-thread' panicked at 'called Result::unwrap() on an Err value: SocketNotFound', lqos_node_manager/src/network_tree.rs:37:58 Mar 22 11:32:50 test-qos lqos_node_manager[11623]: note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Mar 22 11:32:50 test-qos lqos_node_manager[11623]: >> Handler network_tree_summary panicked. Mar 22 11:32:50 test-qos lqos_node_manager[11623]: >> A panic is treated as an internal server error. Mar 22 11:32:50 test-qos lqos_node_manager[11623]: >> No 500 catcher registered. Using Rocket default. Mar 22 11:32:50 test-qos kernel: [150417.138437] i40e 0000:65:00.0: entering allmulti mode. Mar 22 11:32:50 test-qos kernel: [150417.140096] i40e 0000:65:00.1: entering allmulti mode. Mar 22 11:32:51 test-qos lqosd[11619]: [2023-03-22T11:32:51Z WARN lqos_bus::bus::unix_socket_server] Listening on: /run/lqos/bus Mar 22 11:32:51 test-qos kernel: [150418.130496] i40e 0000:65:00.1: entering allmulti mode. Mar 22 13:02:24 test-qos systemd[1]: fwupd.service: Deactivated successfully. Mar 22 13:03:10 test-qos kernel: [155836.786277] perf: interrupt took too long (3938 > 3935), lowering kernel.perf_event_max_sample_rate to 50750 Mar 22 13:57:05 test-qos snapd[865]: autorefresh.go:534: Cannot prepare auto-refresh change: Post "https://api.snapcraft.io/v2/snaps/refresh": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Mar 22 13:57:05 test-qos snapd[865]: stateengine.go:149: state ensure error: Post "https://api.snapcraft.io/v2/snaps/refresh": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

thebracket commented 1 year ago

One thing that'll help immediately: change network.json to read {} - an empty set. No child nodes. That'll give you MUCH better spread across all cores.

In /etc/lqos.conf are you using the XDP bridge? If so, can you make sure you don't also have a kernel bridge (br0 or similar)?

dtaht commented 1 year ago

It is so wonderful to see more new folk leaping in. The fastest way to resolve complicated issues is to join us in the matrix channel here: https://matrix.to/#/#libreqos:matrix.org - the core devs are generally available 6AM-7PM PST, but our userbase is all over the world (or nightowls! ) and can also lean in to help.

ericinidahofalls commented 1 year ago

One thing that'll help immediately: change network.json to read {} - an empty set. No child nodes. That'll give you MUCH better spread across all cores.

In /etc/lqos.conf are you using the XDP bridge? If so, can you make sure you don't also have a kernel bridge (br0 or similar)?

I removed the site in network.json.

I am using XDP and there is no br0 interface.

I will see if i can join the matrix channel when I can. I wont be able to turn IPv4 traffic back on until around 12 hours from now so that I dont cause problems for the people downstream of the qos test server again today.

Is there any information I should collect when I turn IPv4 traffic back on?

dtaht commented 1 year ago

Like I said "office hours are roughly 6AM (call it 7AM)-6PM PST". So time your test to us, and we will be able to help in real time.

Please set RUST_BACKTRACE=1 Why are you running fwupd?

Do you have a mgmt interface? This to me, looks like it was trying to bind to some address and failing...

Mar 22 13:57:05 test-qos snapd[865]: autorefresh.go:534: Cannot prepare auto-refresh change: Post "https://api.snapcraft.io/v2/snaps/refresh": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Mar 22 13:57:05 test-qos snapd[865]: stateengine.go:149: state ensure error: Post "https://api.snapcraft.io/v2/snaps/refresh": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

dtaht commented 1 year ago

oh, you are still successfully using it and routing ipv6 traffic now? If so, could you "klingonize" a screenshot?

ericinidahofalls commented 1 year ago

that is correct. What would you like a screenshot of?

ericinidahofalls commented 1 year ago

We do have a management interface setup. as far as fwupd, it was enabled when ubuntu was installed.

dtaht commented 1 year ago

(you go to configuration in the gui, select redact, then click libreqos...)

image

fwupd is I think the firmware update daemon....

Who's OSPF are you using? does it have logs?

Somehow catching the failure as it happens is on my mind, and I don't know how to do that yet....

dtaht commented 1 year ago

the perf interrupt thing is normal, not an issue. I am assuming that ipv4 died right around here?

Mar 22 13:57:05 test-qos snapd[865]: autorefresh.go:534: Cannot prepare auto-refresh change: Post "https://api.snapcraft.io/v2/snaps/refresh": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

So this says to me it was bound on the wrong interfaces, or didn't have a route or...

ericinidahofalls commented 1 year ago

Here is a screenshot. The snapd process is going to sail since we do not have a public facing interface on the server for it to check for snap updates. IPv4 did die around the time the snapd update attempted to happen.

As far as OSPFv2 goes, we have a Cisco IOS IR device on one side and a Mikrotik device on the other.

screenshot-172 16 0 202_9123-2023 03 22-16_10_49

dtaht commented 1 year ago

Are you in monitor mode at the moment? Just passing stuff that way?

ericinidahofalls commented 1 year ago

monitor mode is disabled screenshot-172 16 0 202_9123-2023 03 22-16_19_05

dtaht commented 1 year ago

the vast majority of your traffic is showing as unshaped (the red bandwidth), and the individual ones are showing (0,0)...

One of the issues that libreqos more or less solves is reserving enough bandwidth for ospf to transit under load... but I still think the problem is deeper...

ericinidahofalls commented 1 year ago

It does not appear that any of the IPv6 traffic is shaping or trying to shape. We see IPv6 addresses reported on the node manager that do not seem to get mapped to the device.

When testing things earlier to see if we, I was able to push 5+ gbps of IPv6 traffic through the xdp bridge without any problem.

we use entries similar to the ones below in our ShapedDevices.csv. Am I formatting the IPv6 prefix incorrectly by any chance?

"94_1010","vlan1010","","","","","100.69.20.0/29","2604:0000:0:be00::/64",10,2,43,11,"" "94_1011","vlan1011","","","","","100.69.20.8/29","2604:0000:0:be04::/64",10,2,43,11,""

ericinidahofalls commented 1 year ago

I should note that the traffic is untagged as it passed through the libreqos bridge. It is only tagged between the core router and customer edge

dtaht commented 1 year ago

I have certainly done myself in by sticking things on the wrong vlan in either ispconfig.py or the .toml file.... what do those lines look like in lqos.conf. We (unfortunately, perhaps) do most of our testing on vlan tagged data... @thebracket ??

ericinidahofalls commented 1 year ago

I've shot myself in foot with vlans before too, just not in this situation (or have i? hehe)

This is what I have in ispconfig.py: OnAStick = False StickVlanA = 0 StickVlanB = 0

here is what is in lqos.conf: [tuning]

IRQ balance breaks XDP_Redirect, which we use. Recommended to leave as true.

stop_irq_balance = true netdev_budget_usecs = 8000 netdev_budget_packets = 300 rx_usecs = 8 tx_usecs = 8 disable_rxvlan = true disable_txvlan = true

What offload types should be disabled on the NIC. The defaults are recommended here.

disable_offload = [ "gso", "tso", "lro", "sg", "gro" ]

[bridge] use_xdp_bridge = true interface_mapping = [ { name = "enp101s0f0", redirect_to = "enp101s0f1", scan_vlans = false }, { name = "enp101s0f1", redirect_to = "enp101s0f0", scan_vlans = false } ] vlan_mapping = []

dtaht commented 1 year ago

The output of the LibreQos.sh script might be enlightening.

Also /etc/network/interfaces

ericinidahofalls commented 1 year ago

Here is the output of ./LibreQos.py (with lqosd service already running): Running Python Version 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0] lqosd is running refreshShapers starting at 22/03/2023 23:48:48 Not first time run since system boot. Validating input files 'ShapedDevices.csv' and 'network.json' Rust validated ShapedDevices.csv network.json passed validation ShapedDevices.csv passed validation Backed up good config as lastGoodConfig.csv and lastGoodConfig.json NIC queues: 12 CPU cores: 12 queuesAvailable set to: 12 Generating parent nodes Generated parent nodes created Executing linux TC class/qdisc commands Executed 2706 linux TC class/qdisc commands Executed 2706 linux TC class/qdisc commands Executing XDP-CPUMAP-TC IP filter commands Executed 1292 XDP-CPUMAP-TC IP filter commands Queue and IP filter reload completed in 0.7 seconds TC commands: 0.3 seconds XDP setup: 0 seconds XDP filters: 0.0092 seconds refreshShapers completed on 22/03/2023 23:48:49

this server is using netplan, here is the netplan config: network: ethernets: eno1: dhcp4: false addresses: [MGMTADDRESS/29] routes:

Let me know if you want me to stop the lqosd service and run LibreQos.py

dtaht commented 1 year ago

ROS7 on the other side? I have heard BFD was problematic there.

dtaht commented 1 year ago

comment from a person in our chat room: id be curious if he remembered to update his hello/dead timers for both ipv4 and ipv6 within ospf. We moved away from BFD for random drops in our OSPF, we just shortened our timers, yes more overhead but bfd was too unreliable.

syadnom commented 1 year ago

I would start by disabling BFD. The implementation in ROS6 is mediocre and in ROS7 absent. May not change things but should help troubleshooting the rest.

dtaht commented 1 year ago

well, the ipv6 address is formatted correctly, but not being picked up either...

ericinidahofalls commented 1 year ago

I can disable bfd on the link when I test again tonight.

ericinidahofalls commented 1 year ago

Stable version of ros6 on the mikrotik side

dtaht commented 1 year ago

My last suggestion is to tag stuff on the way in and out (revising ispConfig.py and /etc/lqos.conf to match). This is not the right thing, as I know untagged works for at least some of our deployment, but tagged is all we have tested recently.

dtaht commented 1 year ago

And please join us in the chat when we wake up? (about 7ish PST): https://matrix.to/#/#libreqos:matrix.org

https://github.com/LibreQoE/LibreQoS/issues/285 looks similar.

ericinidahofalls commented 1 year ago

I want to give you an update on something strange that I found, that would explain why IPv4 was dropping and IPv6 isnt. for some reason, ARP requests for the mikrotik's IP address were making it back to the Cisco with the Cisco's MAC address. This incorrect MAC is showing up in the for only 5 seconds or so. I am going to set static MACs and see if connectivity stays up.

dtaht commented 1 year ago

I am curious as to your status?

ericinidahofalls commented 1 year ago

@dtaht Sorry for the lack of update so far, things got away from me this past weekend. After finding the ARP issue, IPv4 traffic has been running through the bridge just fine. I still havent made any headway on why IPv6 traffic doesn't seem to be shaping. Do you have any thoughts on what would cause IPv6 to not be shaped?

rchac commented 1 year ago

@ericinidahofalls Any MPLS, PPPoE, VPLS, etc? Just 1500 MTU across LibreQoS?

ericinidahofalls commented 1 year ago

@rchac we currently have S-tags on ethertype 88a8 that get untagged and retagged on the switch it is connected to, but the traffic is untagged as it goes through the server running libreqos and is running with an MTU of 1500 at that point.

dtaht commented 1 year ago

Are we good on this yet? Was this with v1.4-rcX or head?

carlosjs23 commented 1 year ago

its required to run ospf? im doing a test install am Im figuring this out for first time. so i wonder if I can test with static IPs between Mikrotik edge and core.

thebracket commented 1 year ago

Ospf is not required. We suggest it (or any other routing setup) for live networks just to provide a bypass if you have problems.

On Sat, Apr 15, 2023, 3:49 PM Carlos A. Escobar @.***> wrote:

its required to run ospf? im doing a test install am Im figuring this out for first time. so i wonder if I can test with static IPs between Mikrotik edge and core.

— Reply to this email directly, view it on GitHub https://github.com/LibreQoE/LibreQoS/issues/286#issuecomment-1509962900, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRU434MRB3XDBCAMBIPCXLXBMCU7ANCNFSM6AAAAAAWEATOCQ . You are receiving this because you were mentioned.Message ID: @.***>

dtaht commented 1 year ago

@ericinidahofalls @nickabocker171 is this resolved? If not, please join us on #libreqos:matrix.org

dtaht commented 1 year ago

@ericinidahofalls @nickabocker171 is this resolved? If not, please join us on #libreqos:matrix.org