Closed nickabocker171 closed 1 year ago
Here is some info about the setup as well as a few things we've noticed:
Sites: 1 Shaped Devices: 640 traffic running through the bridge/lqos: approx 700mbps
About the server: Ubuntu Server 22.02 LTS, bare metal Intel X710 network card Dell PowerEdge R440 LibreQOS V1.4
CPU Info: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 12 Socket(s): 1 Stepping: 4
Network Card: 65:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02) 65:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
network.json: { "ina2": { "downloadBandwidthMbps":8000, "uploadBandwidthMbps":8000 } }
ShapedDevices.csv example: "94_1010","vlan1010","","","ina2","","100.69.20.0/29","2604:0000:0:be00::/64",10,2,43,11,"" "94_1011","vlan1011","","","ina2","","100.69.20.8/29","2604:0000:0:be04::/64",10,2,43,11,""
Looking at syslog between when lqosd was started until IPv4 tanked:
Mar 22 11:32:49 test-qos systemd[1]: Started lqosd.service.
Mar 22 11:32:49 test-qos systemd[1]: Started lqos_node_manager.service.
Mar 22 11:32:50 test-qos lqos_node_manager[11623]: Error: Unable to access /run/lqos/bus. Check that lqosd is running and you have appropriate permissions.
Mar 22 11:32:50 test-qos lqos_node_manager[11623]: Error: Unable to access /run/lqos/bus. Check that lqosd is running and you have appropriate permissions.
Mar 22 11:32:50 test-qos lqos_node_manager[11623]: thread 'rocket-worker-thread' panicked at 'called Result::unwrap()
on an Err
value: SocketNotFound', lqos_node_manager/src/network_tree.rs:37:58
Mar 22 11:32:50 test-qos lqos_node_manager[11623]: note: run with RUST_BACKTRACE=1
environment variable to display a backtrace
Mar 22 11:32:50 test-qos lqos_node_manager[11623]: >> Handler network_tree_summary panicked.
Mar 22 11:32:50 test-qos lqos_node_manager[11623]: >> A panic is treated as an internal server error.
Mar 22 11:32:50 test-qos lqos_node_manager[11623]: >> No 500 catcher registered. Using Rocket default.
Mar 22 11:32:50 test-qos kernel: [150417.138437] i40e 0000:65:00.0: entering allmulti mode.
Mar 22 11:32:50 test-qos kernel: [150417.140096] i40e 0000:65:00.1: entering allmulti mode.
Mar 22 11:32:51 test-qos lqosd[11619]: [2023-03-22T11:32:51Z WARN lqos_bus::bus::unix_socket_server] Listening on: /run/lqos/bus
Mar 22 11:32:51 test-qos kernel: [150418.130496] i40e 0000:65:00.1: entering allmulti mode.
Mar 22 13:02:24 test-qos systemd[1]: fwupd.service: Deactivated successfully.
Mar 22 13:03:10 test-qos kernel: [155836.786277] perf: interrupt took too long (3938 > 3935), lowering kernel.perf_event_max_sample_rate to 50750
Mar 22 13:57:05 test-qos snapd[865]: autorefresh.go:534: Cannot prepare auto-refresh change: Post "https://api.snapcraft.io/v2/snaps/refresh": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Mar 22 13:57:05 test-qos snapd[865]: stateengine.go:149: state ensure error: Post "https://api.snapcraft.io/v2/snaps/refresh": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
One thing that'll help immediately: change network.json
to read {}
- an empty set. No child nodes. That'll give you MUCH better spread across all cores.
In /etc/lqos.conf
are you using the XDP bridge? If so, can you make sure you don't also have a kernel bridge (br0 or similar)?
It is so wonderful to see more new folk leaping in. The fastest way to resolve complicated issues is to join us in the matrix channel here: https://matrix.to/#/#libreqos:matrix.org - the core devs are generally available 6AM-7PM PST, but our userbase is all over the world (or nightowls! ) and can also lean in to help.
One thing that'll help immediately: change
network.json
to read{}
- an empty set. No child nodes. That'll give you MUCH better spread across all cores.In
/etc/lqos.conf
are you using the XDP bridge? If so, can you make sure you don't also have a kernel bridge (br0 or similar)?
I removed the site in network.json.
I am using XDP and there is no br0 interface.
I will see if i can join the matrix channel when I can. I wont be able to turn IPv4 traffic back on until around 12 hours from now so that I dont cause problems for the people downstream of the qos test server again today.
Is there any information I should collect when I turn IPv4 traffic back on?
Like I said "office hours are roughly 6AM (call it 7AM)-6PM PST". So time your test to us, and we will be able to help in real time.
Please set RUST_BACKTRACE=1 Why are you running fwupd?
Do you have a mgmt interface? This to me, looks like it was trying to bind to some address and failing...
Mar 22 13:57:05 test-qos snapd[865]: autorefresh.go:534: Cannot prepare auto-refresh change: Post "https://api.snapcraft.io/v2/snaps/refresh": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Mar 22 13:57:05 test-qos snapd[865]: stateengine.go:149: state ensure error: Post "https://api.snapcraft.io/v2/snaps/refresh": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
oh, you are still successfully using it and routing ipv6 traffic now? If so, could you "klingonize" a screenshot?
that is correct. What would you like a screenshot of?
We do have a management interface setup. as far as fwupd, it was enabled when ubuntu was installed.
(you go to configuration in the gui, select redact, then click libreqos...)
fwupd is I think the firmware update daemon....
Who's OSPF are you using? does it have logs?
Somehow catching the failure as it happens is on my mind, and I don't know how to do that yet....
the perf interrupt thing is normal, not an issue. I am assuming that ipv4 died right around here?
Mar 22 13:57:05 test-qos snapd[865]: autorefresh.go:534: Cannot prepare auto-refresh change: Post "https://api.snapcraft.io/v2/snaps/refresh": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
So this says to me it was bound on the wrong interfaces, or didn't have a route or...
Here is a screenshot. The snapd process is going to sail since we do not have a public facing interface on the server for it to check for snap updates. IPv4 did die around the time the snapd update attempted to happen.
As far as OSPFv2 goes, we have a Cisco IOS IR device on one side and a Mikrotik device on the other.
Are you in monitor mode at the moment? Just passing stuff that way?
monitor mode is disabled
the vast majority of your traffic is showing as unshaped (the red bandwidth), and the individual ones are showing (0,0)...
One of the issues that libreqos more or less solves is reserving enough bandwidth for ospf to transit under load... but I still think the problem is deeper...
It does not appear that any of the IPv6 traffic is shaping or trying to shape. We see IPv6 addresses reported on the node manager that do not seem to get mapped to the device.
When testing things earlier to see if we, I was able to push 5+ gbps of IPv6 traffic through the xdp bridge without any problem.
we use entries similar to the ones below in our ShapedDevices.csv. Am I formatting the IPv6 prefix incorrectly by any chance?
"94_1010","vlan1010","","","","","100.69.20.0/29","2604:0000:0:be00::/64",10,2,43,11,"" "94_1011","vlan1011","","","","","100.69.20.8/29","2604:0000:0:be04::/64",10,2,43,11,""
I should note that the traffic is untagged as it passed through the libreqos bridge. It is only tagged between the core router and customer edge
I have certainly done myself in by sticking things on the wrong vlan in either ispconfig.py or the .toml file.... what do those lines look like in lqos.conf. We (unfortunately, perhaps) do most of our testing on vlan tagged data... @thebracket ??
I've shot myself in foot with vlans before too, just not in this situation (or have i? hehe)
This is what I have in ispconfig.py: OnAStick = False StickVlanA = 0 StickVlanB = 0
here is what is in lqos.conf: [tuning]
stop_irq_balance = true netdev_budget_usecs = 8000 netdev_budget_packets = 300 rx_usecs = 8 tx_usecs = 8 disable_rxvlan = true disable_txvlan = true
disable_offload = [ "gso", "tso", "lro", "sg", "gro" ]
[bridge] use_xdp_bridge = true interface_mapping = [ { name = "enp101s0f0", redirect_to = "enp101s0f1", scan_vlans = false }, { name = "enp101s0f1", redirect_to = "enp101s0f0", scan_vlans = false } ] vlan_mapping = []
The output of the LibreQos.sh script might be enlightening.
Also /etc/network/interfaces
Here is the output of ./LibreQos.py (with lqosd service already running): Running Python Version 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0] lqosd is running refreshShapers starting at 22/03/2023 23:48:48 Not first time run since system boot. Validating input files 'ShapedDevices.csv' and 'network.json' Rust validated ShapedDevices.csv network.json passed validation ShapedDevices.csv passed validation Backed up good config as lastGoodConfig.csv and lastGoodConfig.json NIC queues: 12 CPU cores: 12 queuesAvailable set to: 12 Generating parent nodes Generated parent nodes created Executing linux TC class/qdisc commands Executed 2706 linux TC class/qdisc commands Executed 2706 linux TC class/qdisc commands Executing XDP-CPUMAP-TC IP filter commands Executed 1292 XDP-CPUMAP-TC IP filter commands Queue and IP filter reload completed in 0.7 seconds TC commands: 0.3 seconds XDP setup: 0 seconds XDP filters: 0.0092 seconds refreshShapers completed on 22/03/2023 23:48:49
this server is using netplan, here is the netplan config: network: ethernets: eno1: dhcp4: false addresses: [MGMTADDRESS/29] routes:
Let me know if you want me to stop the lqosd service and run LibreQos.py
ROS7 on the other side? I have heard BFD was problematic there.
comment from a person in our chat room: id be curious if he remembered to update his hello/dead timers for both ipv4 and ipv6 within ospf. We moved away from BFD for random drops in our OSPF, we just shortened our timers, yes more overhead but bfd was too unreliable.
I would start by disabling BFD. The implementation in ROS6 is mediocre and in ROS7 absent. May not change things but should help troubleshooting the rest.
well, the ipv6 address is formatted correctly, but not being picked up either...
I can disable bfd on the link when I test again tonight.
Stable version of ros6 on the mikrotik side
My last suggestion is to tag stuff on the way in and out (revising ispConfig.py and /etc/lqos.conf to match). This is not the right thing, as I know untagged works for at least some of our deployment, but tagged is all we have tested recently.
And please join us in the chat when we wake up? (about 7ish PST): https://matrix.to/#/#libreqos:matrix.org
https://github.com/LibreQoE/LibreQoS/issues/285 looks similar.
I want to give you an update on something strange that I found, that would explain why IPv4 was dropping and IPv6 isnt. for some reason, ARP requests for the mikrotik's IP address were making it back to the Cisco with the Cisco's MAC address. This incorrect MAC is showing up in the for only 5 seconds or so. I am going to set static MACs and see if connectivity stays up.
I am curious as to your status?
@dtaht Sorry for the lack of update so far, things got away from me this past weekend. After finding the ARP issue, IPv4 traffic has been running through the bridge just fine. I still havent made any headway on why IPv6 traffic doesn't seem to be shaping. Do you have any thoughts on what would cause IPv6 to not be shaped?
@ericinidahofalls Any MPLS, PPPoE, VPLS, etc? Just 1500 MTU across LibreQoS?
@rchac we currently have S-tags on ethertype 88a8 that get untagged and retagged on the switch it is connected to, but the traffic is untagged as it goes through the server running libreqos and is running with an MTU of 1500 at that point.
Are we good on this yet? Was this with v1.4-rcX or head?
its required to run ospf? im doing a test install am Im figuring this out for first time. so i wonder if I can test with static IPs between Mikrotik edge and core.
Ospf is not required. We suggest it (or any other routing setup) for live networks just to provide a bypass if you have problems.
On Sat, Apr 15, 2023, 3:49 PM Carlos A. Escobar @.***> wrote:
its required to run ospf? im doing a test install am Im figuring this out for first time. so i wonder if I can test with static IPs between Mikrotik edge and core.
— Reply to this email directly, view it on GitHub https://github.com/LibreQoE/LibreQoS/issues/286#issuecomment-1509962900, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRU434MRB3XDBCAMBIPCXLXBMCU7ANCNFSM6AAAAAAWEATOCQ . You are receiving this because you were mentioned.Message ID: @.***>
@ericinidahofalls @nickabocker171 is this resolved? If not, please join us on #libreqos:matrix.org
@ericinidahofalls @nickabocker171 is this resolved? If not, please join us on #libreqos:matrix.org
Hi i wanted to see if i could get some assistance with this software. We got the server setup and deployed on a test site with an OSPF connection. Everything appears to be working and then suddenly the OSPF breaks my IPV4 traffic and i have to quickly terminate the OSPF to avoid an outage. his would happen about every 30 minutes but today i ran it from 5:30 am to about 8:00am before it started to crash. Im curious if you know why?