LibreQoE / LibreQoS

A Quality of Experience and Smart Queue Management system for ISPs. Leverage CAKE to improve network responsiveness, enforce bandwidth plans, and reduce bufferbloat.
https://libreqos.io/
GNU General Public License v2.0
414 stars 46 forks source link

Supporting Dynamic Network Topologies #10

Closed mjsteckel closed 3 months ago

mjsteckel commented 2 years ago

From the docs and code, LibreQOS appears to have a model of one link between any two sites. (Correct me if I'm wrong about this!)

Because of weather, it is natural for wireless links to a) go down and come back, up or b) temporarily lose capacity

We have a number of sites that have two parallel links between them. Some of the links are active/active, while others are active/passive.

Example 1:

Site A <=> Site B

Link 1: AF60 Link 2: AF5x-HD

The OSPF route costs are configured to pass all traffic over Link 1 a long as it is available. When rain takes the link down, all traffic is routed over Link 2.

Of course Link 2 has less capacity and Link 1.

Example 2:

Site A <=> Site B

Link 1: AF24 Link 2: AF11

In this case, both radio links have similar capacity and we actively operate them in parallel. That is, both radio links carry traffic at all times.

Of course, severe rain storms occasionally take down the AF24 link and we lose 50% of the bandwidth capacity between the sites.


Questions:

1) What is the best way (if there is one) for manage two parallel links with LibreQoS? 2) In an ideal world, I would like to dynamically update the capacity of a Site (Sites.scv) when links go down and come back up. Presuming that if it is possible to track link state changes, it would be easy enough to update Sites.cvs and rerun LIbreQoS.

Bonus scenarios:

We actually have a number of sites that have multiple (non-parallel) paths back to the data center for redundancy.

DC --- Site A ---\
            |     Site C
DC --- Site B ---/

We also have cases like this:

DC --- Site A --- Site B --- Site C

rchac commented 2 years ago

My WISP operates similarly, and the least painful/complex way to handle it I've found is to shape by site, assuming the bandwidth available to the site under normal conditions. This however is not ideal because like you suggest, bandwidth for backup p2ps can be 50% or less of a primary link and clients can trigger some packet loss if they are pushing too much data during the period the primary p2p is down.

If we want to handle this more gracefully, we could have SNMP triggers tied to NMSs which would alert to the failure of a primary site link, and step down the site's bandwidth cap to a predefined value in Sites.csv. An even simpler way would be that for each site, LibreQoS could ping a designated management IP for the rain-affected end of a primary P2P back-haul. If that ping target goes offline for more than a minute, LibreQoS could start shaping the site based on a pre-defined alternate bandwidth ceiling value in Sites.csv. I'm tempted to do it that way because it's much easier to implement, but I'd like feedback there. Would that work for your network use case?

mjsteckel commented 2 years ago

We implement carrier-drop on the AF24 links. We'waiting for carrier-drop to stabilize in the AF60 firmware before deploying it widely...

Minor quibble - I don't think the remote radio mgmt IP should be used, esp if out-of-band mgmt is enabled. Instead I think it makes more sense to use the remote router IP for the p2p link.

The time out periods for a link being down and up probably need to be configurable (somewhat similar to the carrier drop feature on the AF24 and AF5X-HD).

I think this could work for parallel p2p links. Not sure about more complicated typologies.

Probably need a way to specify dependent links that need to be modified when a link state changes.

All of this is good, but is limited to a link being up or down. I'm not sure it is possible/reasonable to consider cases where a link capacity changes significantly. I'm sure it's theoretically possible, but unsure if it practically possible or worth the effort.

rchac commented 2 years ago

That makes sense. I'd really like to try that - with ping targeting the remote router IP for the p2p link and backup link. If the remote end of primary p2p fails, LibreQoS would switch the bandwidth ceiling for that site to the pre-defined "alternate" cap in Sites.csv, then refresh shapers. If pings fail to the remote end of both the primary and backup p2p, we will know it's something else going on within the network, and that there is no need to change shapers. That way a site maintenance won't trigger lots of shaper refreshing.

I agree carrier drop on the primary p2p would be important here. The alternative is doing SNMP polling of the radio capacity of the PtPs. The radios we most commonly use in these scenarios are the AF60LR, which have both SNMP and carrier drop. For your network scenario, would you personally prefer SNMP or ping based checks? Just want to get an idea of which path I should take to make this most useful to operators.

mjsteckel commented 2 years ago

The recent AF60 firmware update 2.6 finally(!) includes SNMP but I have my doubts about the data returned. Not sure if it's the SNMP implementation or if the MIB has problems.

Regardless, the ping approach seems simpler and is likely easier/cleaner. Probably want to use a TCP ping, as ICMP ping will probably cause more false positives (ping failures) especially if a link is saturated.

dtaht commented 2 years ago

One of the really cool discoveries of https://forum.openwrt.org/t/cake-w-adaptive-bandwidth/108848/ was that many devices responded accurately to icmp msg 13, with a ntp-like timestamp.

mjsteckel commented 2 years ago

A lot of good stuff there at the link but I'm not sure I should thank you or curse you given the length of the thread. :-)

Bit apples & oranges though... A WISP operator has a much more concrete understanding of their network state (the real time capacity of wireless links, or whether they are up or down). The end-user of a service can only respond to changing network capacity by observing connection and probe metrics. Sort of being able to see behind the curtain vs only seeing shadows on the curtain.

dtaht commented 2 years ago

I agree that a clueful operator is vastly superior!!!!! But they are far and few inbetween. I was mostly pointing you at a testable script.

I also use smokeping with the irtt plugin sometimes.

mjsteckel commented 2 years ago

It's all good and I appreciate the link!

Another approached I've pondered that might be suitable for an operator in this situation is to track OSPF adjacency change events. If OSPF updated the routing, something change. Might be a wireless link going down, might be a router failed, etc. Either way, an OSPF route change event likely means I want to have something like LibreQoS run an update. (I was reluctant to mention using using OSPF for this earlier given my PTSD from dealing with EdgeOS OSPF issues...)

dtaht commented 2 years ago

Due to memory constraints and the need to run two ospf daemons to do ipv6 properly, back when I only had 23MB of memory on my routers,I started exploring the babel protocol in 2007, and have worked on it, periodically, ever since. I don't know how much ospf has improved since then, and although babel mostly evolved in positive directions (I love source specific routing as one example), I don't know it's suitability in your market. It has a bird, frr, and standalone daemon.

syadnom commented 2 years ago

might I suggest a vendor agnostic approach here which is simply a target IP address per site and a trace route to discover topology. maybe snmp walk each site router to get all assigned IP addresses to associate with the trace routes. That would build out a network topology of site-to-edge data and it would be dynamic. You could ping each of these target addresses and watch for a TTL change and any of them and trigger the trace route modeling. alternatively just do it routinely.

Then, build a shaping tree based on that site mapping. I did another post on chains of backhauls that is related to this.

This way it's just measured values and no tearing apart OSPF, bgp, mpls, whatever. As long as you could get the pings to go down the 'real' path. Ie, NOT inside an VPLS or VLAN etc. Add a /32 in a dedicated subnet that you distribute in the first igp for example. If you can't get snmp data, put it in the csv. Here are all the assigned backhaul IP addresses at site 7. (I'm just considering that trace route is going to return the interface IPs, not the loopback IPs.... so need that site to backhaul link IP cross reference).

interduo commented 2 years ago

If You use NV2 on Mikrotik or Airmax from Ubiquiti this is not the case any more.

syadnom commented 2 years ago

That would build out a network topology of site-to-edge data and it would be dynamic. You could ping each of these targe

not entirely sure how this has anything to do with the thread here.

dtaht commented 1 year ago

What have we not solved on this bug @mjsteckel in v1.3?