Closed peterneutron closed 1 year ago
I would recommend not running irqbalance inside a VM. Normally it should be fine, but depending on how you have CPU pinning configured from your host to your guest, its possible that your guest may affine an interrupt to a physical CPU that isn't mapped into your guest cpus at all, leading to a loss of softirq handling, and the hang you describe. Just let the host handle interrupt affining
Thanks for the advise. The base M1 has 8-cores total split in 4 High Performance and 4 Energy Efficiency ones. QEMU exposes the 4 HP-cores only. Is it still possible the VM tries to access the other four? And why wouldn't it cause any problems in any version below 1.9.x?
Honestly, I don't know, I've never looked at qemu on macos before. But I certainly can't diagnose a problem in irqbalance when the symptom doesn't occur in irqbalance. Long story short, all irqbalance does is write affinity values to the proc/irq/
As for why it only happens on post v1.9.0 version, I suspect it was because of several fixes that went into the balancing algorithm, which prior to 1.9.0 led to several irqs on various non-x86 arches never getting selected for rebalancing
Thanks for taking your time with this. I will close this now and will report back if anything comes up on my end.
@nhorman I found the commit that causes this issue on my setup and to my surprise it seems to have nothing to do with the AARCH64 related commits since 1.8.0.
Commit: 2a66a666d3e202dec5b1a4309447e32d5f292871
What I did was essentialy building irqbalance 1.8.0 with every commit one by one up until 1.9.2. At the moment I'am running 1.9.2 with only this commit removed which seems to have fixed the hanging network adapters. Can you make any sense of this?
That has everything to do with with AARCH64, in that the code you are referring to affects all arches. As I noted above, prior to that change, several irqs were never getting selected for rebalancing, which hid hang from you
Then why doesn't it happen when I disable irqbalance and manually change affinity via /proc/irq?
I don't know @peterneutron , but its not something I'm going to be able to help you with. Irqbalance's interface to the kernel is exactly the same as the one you are writing to manually. There may be a timing issue at play here that triggers the hang, for which you can use irqbalance to reproduce, but if your system is hanging as a result of whatever that magic order of operations is, the root cause, cannot be irqbalance. If the adapter in question stops responding to interrupts, you're going to need to instrument the kernel driver (or write a systemtap script) to figure out whats going on.
I thank you again and will leave it at that because instrumenting the kernel driver or creating a systemtap is way out of my league.
Host: M1 Mac mini Host OS: macOS 13.0.1 (22A400) QEMU: 7.1.0
Guest: Arch Linux ARM (virtualized) Kernel: 5.19.8 Network: Bridged (2 interfaces)
Affected irqbalance Version: => 1.9.0 Last working irqbalance Version: =< 1.8.0
Summary: Every version of irqbalance => 1.9.0 hangs one of two interfaces at random in an arbitrary timeframe.
Steps taken: Cross checked with different combinations of QEMU and the kernel, issue still persists. Checked service/systemd/network/kernel logs but couldn't make out any related entries.
I know this is a niche case and my ability to debug this are limited but maybe someone is able to point me in the right direction.