LibreQoE / LibreQoS

A Quality of Experience and Smart Queue Management system for ISPs. Leverage CAKE to improve network responsiveness, enforce bandwidth plans, and reduce bufferbloat.
https://libreqos.io/
GNU General Public License v2.0
410 stars 45 forks source link

Intel XL710 vs Mellanox ConnectX-6 Lx #126

Closed Belyivulk closed 1 year ago

Belyivulk commented 1 year ago

Hey Team,

I thought i'd write a quick note here that we tried moving from the Intel XL710 (40G) to the Mellanox ConnectX-6 (25G) nics and found that the Mellanox performs considerably worse than the Intel. Specifically we were seeing upwards of 20% of CPU time being consumed by software interrupts vs ~6% for the intel; which would kneecap scalability.

I did have a play around enabling/disabling various offloads and different driver revisions but ultimately had to abandon that idea for the week.

My best guess is that Intel is doing something in hardware that the Mellanox is not but i'd be interested to hear your ideas. In a ideal world (due to port count challenges) it is ideal for us to use the Mellanox - if we can get those SI's down to a manageable level.

interduo commented 1 year ago

Hi, Mellanox even X4 is very good product.

Look at those tips https://github.com/rchac/LibreQoS/issues/96 and try to set coalescence in other way

Belyivulk commented 1 year ago

In the context of Mellanox; these commands make much more sense to me now:

ethtool -C netdev0 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 122 ethtool -C netdev1 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 122

interduo commented 1 year ago

Could You paste before and after results?

Belyivulk commented 1 year ago

Next week, yeap :)

dtaht commented 1 year ago

This is for density or cost or?

I was looking over one nvidia product that claimed it had ovs onboard (for which we ported fq_codel to 7 years ago)...

Belyivulk commented 1 year ago

Purely port count; we're running low on 40/100G ports (which the intel NICs use) so it would be nice if we could use the 25Gig Mellanox (plenty of ports available at this speed)

dtaht commented 1 year ago

20% overhead (per core?) doesn't sound all that bad to me, unless fire starts coming out of the computer. DPDK is 100% overhead. What was the cpu in these cases?

Belyivulk commented 1 year ago

Sorry i should have been clearer. CPU loading (for the queueing) was around 10% per core; but with the interrupts that was being pushed to 30% (per core) overall loading.

With the intel nics i am seeing around 12% overall loading for the same traffic (as it stands right now).

dtaht commented 1 year ago

I feel a need to do a bit of math. Please check me. 25Gbit is 520ns per large packet, 24ns for small?

At the 62us interrupt rate suggested by @interduo thats 2583 small packets on the rx ring before the cpu gets an interrupt, or 112 large ones. So a large rx ring is just fine by my lights. That's still a lot of packets on the rx ring per interrupt, however, and a higher interrupt frequency would be fine by me if sustainable. What's the default?

And 30% of cpu is partially the wrong way to measure but I care about cache utilization, locks and context switch times. servicing less packets per a higher interrupt rate takes better advantage of cache.

I would start worrying when a given core cracks 50%.

...

My specific question was what cpu type were you using?

The Xeon gold is "only" 18 cores. The AMD EPYC Rome 7702 has 64 cores and an astounding 256MB of cache. Got any of those? :P

Belyivulk commented 1 year ago

I can't easily switch from Intel to AMD so we wont go there. We're the Intel Xeon Gold 6154.

I suppose the point I was making was that switching from intel to nvidia saw the CPU usage go up considerably (and from what i could tell in TOP was that it was largely software interrupts). If it scaled like that then by peak time we'd be using 100% of all 18 cores; where as with the intel nics we float between 20 and 30%

Finally, you have access to this box; so you're welcome to have a poke around :)

Belyivulk commented 1 year ago

Okay; so don't do this on intel nic's

"ethtool -C netdev0 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 122 ethtool -C netdev1 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 122"

On the face of it; CPU usage lowers. In practicality it destroys user experience.

dtaht commented 1 year ago

I had figured it would. Thx for the confirmation. Batching is a bad idea. Do less work, more often, a mantra.

interduo commented 1 year ago

Okay; so don't do this on intel nic's

"ethtool -C netdev0 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 122 ethtool -C netdev1 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 122"

On the face of it; CPU usage lowers. In practicality it destroys user experience.

Why? Could You show more tests about that? How did You checked that?

How much network throughput do You have?

Belyivulk commented 1 year ago

If you let me know what testing you'd like; I can do that specific test. I implemented the change ~10 days ago and noted that CPU loading decreased (which was nice). There was no observable impact; so i left it as is. In the past 3-4 days we saw a increase in "peak time" latency complaints (gamers) which were all but nill prior; with users reporting "in the past few weeks". I've been quite busy so haven't really been using the net at peak time until last night when; my wife and I went to play fortnight. The game had a large update which both PC's started and redlined the connection to my house (which is limited to 50/6mbit via LibreQoS). I noted that i was getting 30 mbit while my wife was struggling for 10 (and we had someone else streaming in the house). The streaming began to buffer; so i tried to load whatsapp (on my laptop) and as you'd expect that experience was painfully slow. Tried a few other sites; painfully slow. Jumped into LibreQoS and undid that change. CPU jumped ~4% immediately (from 20% overall to 24% overall) and everything became almost instantly snappier and otherwise perfect (with the game update still downloading). So; thats how the issue came to my attention and how I believe I've resolved it.

To answer the other question; at that time we were doing 8.4gig/sec. We have ~5000 subscribers routed through the box at present and we shape by Site, AP and then sub; with ack-filter on (per dtaht's suggestion).

dtaht commented 1 year ago

This is what I get for not monitoring more closely. :/ Maybe it will show up in the data. My concern was that you'd be dropping packets in the rx ring with such a large interval, and there are other issues.

dtaht commented 1 year ago

Belyivulk - I did punt this to v1.4, but ya know, if you are into having more "hold my beer, I got this" kind of moments, during biz hours in the usa....

Belyivulk commented 1 year ago

I don't understand lol. I intend to revisit the Mellonix nics in the AMD test unit that's arriving in a couple weeks :)

dtaht commented 1 year ago

You talked about your AMD experience as being mostly negative, but I was inclined to blame the card, more than the CPUs. Do you still have that? Any chance you could test the same card in the AMD box next time? With the new bridge?

Belyivulk commented 1 year ago

Heya, sorry no I don't have the box anymore. I do have the nics, and can re-test those at a later date.