LibreQoE / LibreQoS

A Quality of Experience and Smart Queue Management system for ISPs. Leverage CAKE to improve network responsiveness, enforce bandwidth plans, and reduce bufferbloat.
https://libreqos.io/
GNU General Public License v2.0
444 stars 48 forks source link

Realtime kernels and spectre mitigations #141

Open dtaht opened 2 years ago

dtaht commented 2 years ago

I in general run realtime kernels wherever I can (leveraging ubuntu studio in my case). My principal application (ardour) requires it in order to have good flow for audio mixing. With R/T I can usually do 2.7ms of latency reliably for audio mixing. RT in general has historically been used for many device control applications, and I've sometimes worried a lot that my data on packet processing was skewed flatter because I'm always testing on a RT kernel. Anyway, testing on a RT kernel on bare metal might show an improvement on irq handling and other long tail p99 latencies, and I do rather highly recommend using it on your desktop.

Somewhat relative to that - is that the plethora of spectre vulnerability mitigations are not needed on a bare metal system, and some can be disabled at boot, others compiled out. Spectre is primarily a virtual-machine-breaching problem. One of the most recent vulns killed performance by over 30% with the initial round of mitigations.

So it does seem possible to produced a more tuned kernel for what libreqos is doing. But it would help to measure more first, this is just a note for future use.

interduo commented 2 years ago

Remmember to set: noibrs noibpb nopti nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off tsx=on tsx_async_abort=off mitigations=off (not only mitigations=off)

If You use virtualization, disable set kernel options in hypervisor and in guest VM.

The penalty of not turning it off depends on CPU generation if newer than the penalty is less. We did a performance check and on week CPU ussage (average, not max) there was a ~3% diference (on Proxmox VM) but the network throughput was slightly different.

@rchac maybe You could add this to wiki in the performance tips?

rchac commented 2 years ago

As mentioned by @tohojo here it's important we consider the security implications before recommending disabling mitigations by default. @interduo true - often, newer BIOS and CPU firmware have the mitigations baked into hardware to where disabling them in the kernel either has no impact or actually reduces performance. 3% difference may not be enough to justify the risk. That said, I agree there are cases where turning it off could dramatically improve performance in VM hosts where LibreQoS is the only guest.

We should probably keep doing measurements before/after on additional recent CPUs to see what impact it has, and evaluate the threat model. InfluxDB and the Flask API seem like the main potential targets - but these are usually only accessible to ISP employees anyway. I just want to be careful in prescribing turning off mitigations.