binpacking is not optimal - improvements needed

interduo commented 1 year ago

We got 3 types of clients with different type of utilisation:

radio (many customers, not more than 50M),
fiber (most customers with 1gbps max, not constant utilisation),
fiber-other-isp (few customers with 2gbps max, constant utilisation),

Problem: I saw longterm CPU stats - one or some times two core are constantly too much pressed (above 75%) during rush hours, (it's too much because some other customer could get not available his bandwidth) and there are more than 10 cores below 10%.

First idea: Use total bytes counter (or any other if it's better) from tc command to rebalance circuits between CPU cores.

Second idea: There should be a flag in shapedDevices.csv for Circuit (if in any row with this circuitID flag is on for whole circuit). Customers with this flag should be on different cpu core alone. If there is more circuits with this flag than half of cpu core - just put them together in reseved cpu core.

It's also suggested to separate most busy cores between physical CPU if there are more than one.

What do You think about it @rchac @dtaht @thebracket?

rchac commented 1 year ago

Just curious - if you turn binpacking off, how much more or less balanced are CPU cores?

interduo commented 1 year ago

I will check that outside "rush hours".

interduo commented 1 year ago

Turning binpacking off generates less balanced cpu cores load. There are more busy cores and more idle cores.

interduo commented 8 months ago

Lately I found that monster: Intel Core i9-14900K whitch has 61K CPUmarks. It has some performance cores with 6GHz and few cores with less cpu clock speed. Moreover AMD also said somewhere on the conference that in new ryzens would change design.

Binpacking depending only on past customer network data would be not optimal - so I suggest to also catch cpu stats.

thebracket commented 8 months ago

I'm pretty sure that what you're asking for is an algorithm that handles both variable bin sizes and variable circuit weights.

For example, on my test workstation lscpu --all --extended identifies the various core types:

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ 0 0 0 0 0:0:0:0 yes 4900.0000 800.0000 1975.8340 1 0 0 0 0:0:0:0 yes 4900.0000 800.0000 950.2440 2 0 0 1 4:4:1:0 yes 4900.0000 800.0000 800.0000 3 0 0 1 4:4:1:0 yes 4900.0000 800.0000 3638.5171 4 0 0 2 8:8:2:0 yes 4900.0000 800.0000 2352.7849 5 0 0 2 8:8:2:0 yes 4900.0000 800.0000 4353.0591 6 0 0 3 12:12:3:0 yes 4900.0000 800.0000 3775.8230 7 0 0 3 12:12:3:0 yes 4900.0000 800.0000 800.0000 8 0 0 4 16:16:4:0 yes 5000.0000 800.0000 4844.3970 9 0 0 4 16:16:4:0 yes 5000.0000 800.0000 800.0000 10 0 0 5 20:20:5:0 yes 5000.0000 800.0000 4876.8018 11 0 0 5 20:20:5:0 yes 5000.0000 800.0000 4407.2832 12 0 0 6 24:24:6:0 yes 4900.0000 800.0000 3902.9741 13 0 0 6 24:24:6:0 yes 4900.0000 800.0000 800.0000 14 0 0 7 28:28:7:0 yes 4900.0000 800.0000 800.0000 15 0 0 7 28:28:7:0 yes 4900.0000 800.0000 800.0000 16 0 0 8 36:36:9:0 yes 3800.0000 800.0000 799.3740 17 0 0 9 37:37:9:0 yes 3800.0000 800.0000 800.0000 18 0 0 10 38:38:9:0 yes 3800.0000 800.0000 3056.3379 19 0 0 11 39:39:9:0 yes 3800.0000 800.0000 800.0000

As you can see, my desktop (an i7) has 20 cores. Some peak at 4900, some at 5000, some at 3800. You can also see the current core speeds, but that's not overly relevant because it changes dynamically over time.

So --- assuming that you've left the "extra" cores enabled --- what you really want is to have 20 "bins" into which circuits are assigned. You want to maximize the spread based on the "weight" of each circuit, while limiting the capacity relative to one another. That's actually a tricky problem without doing an "all combinations" algorithm, but "best fit" and "first fit" can both give an approximately optimal answer ( see https://cstheory.stackexchange.com/questions/12386/bin-packing-approximation-with-different-bin-sizes for some theory).

So where do the weights come from? For regular users, the closest we have is "plan size" and optionally some recent usage. Plan size is already in 1.4, recent usage is hoped (but not done yet - just get LTS, it's good for you) for inclusion in 1.5. If you have long-term stats, then you can do a much more accurate "usage at this time of day, day of week, in previous weeks" - which is live in the 1.5 development branch and has been very helpful for live systems at both my ISP and Robert's. (On aggregate, people are remarkably predictable). You can mix and match these statistics, to give a reasonable weight. [Note that both Robert & I use a hierarchy, and we now also binpack hierarchies. It's pretty neat, you can watch the queues move around - primarily business sites tend to be demoted at night, while primarily residential sites are promoted to having their own slots].

Now the big problem is translating "weight units" (how much does a circuit need) into "bin units" (how much capacity is available). We'd need to solve that, and implement a high-speed variable bin-size packing algorithm to go much further. Like I said, the improvements we've already made in the development branch have helped a lot - but I can see why this would help further.

On Tue, Mar 19, 2024 at 5:17 AM Interduo @.***> wrote:

Lately I found that monster: Intel Core i9-14900K whitch has 61K CPUmarks. It has some performance cores with 6GHz and few cores with less cpu clock speed. Moreover AMD also said somewhere on the conference that in new ryzens would change design.

Binpacking depending only on past customer network data would be not optimal - so I suggest to also catch cpu stats.

— Reply to this email directly, view it on GitHub https://github.com/LibreQoE/LibreQoS/issues/424#issuecomment-2006693419, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRU437VWEHMQOLIBW6GG73YZAGFLAVCNFSM6AAAAAA7WRASGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBWGY4TGNBRHE . You are receiving this because you were mentioned.Message ID: @.***>

interduo commented 8 months ago

OK - i will wait for 1.5 first release and then update my prod instance.

Nowadays taking "plan size" into account is useless because one company with 250M would generate more load than single customers with 1GBit network connection.

LibreQoE / LibreQoS

binpacking is not optimal - improvements needed #424