Irqbalance / irqbalance

The irqbalance source tree - The new official site for irqbalance
http://irqbalance.github.io/irqbalance/
GNU General Public License v2.0
576 stars 139 forks source link

Non-PCI IRQs not balanced. #326

Open sean-anderson-seco opened 1 month ago

sean-anderson-seco commented 1 month ago

I am running irqbalance on an embedded system, where the vast majority of interrupts are from platform devices. irqbalance makes an attempt to classify IRQs based on their name (guess_arm_irq_hints). However, this is generally ineffective because:

Because of all these reasons, the heuristics used by irqbalance generally just result in the default of IRQ_TYPE_LEGACY/IRQ_OTHER. Although this does not correctly identify the type of interrupt, it might have been fine except that all CPUs are in a single package and cache domain. Because IRQ_OTHER interrupts default to BALANCE_PACKAGE, no balancing is performed at all! This may be remedied using a script

# cat balance.sh 
#!/bin/sh
echo balance_level=core
# irqbalance -ol ./balance.sh

but the default behavior remains useless and confusing (since you could get the same result without running IRQbalance at all).

I think the best way to fix this would be to make the balance level for IRQ_OTHER default to the highest balance level with things to balance between. E.g. on a system with one package and two cache domains with four CPUs each, it would default to BALANCE_CACHE. On a system with one package, one cache domain, and three CPUs it would default to BALANCE_CORE.

I think it also would be nice to be able to specify the balance level for IRQ_OTHER interrupts via a command-line option, but this would still leave undesirable default behavior.


# irqbalance -do              
This machine seems not NUMA capable.
Prevent irq assignment to these isolated CPUs: 00000000
Prevent irq assignment to these adaptive-ticks CPUs: 00000000
Banned CPUs: 00000000
Package 0:  numa_node -1 cpu mask is 0000000f (load 0)
        Cache domain 0:  numa_node is -1 cpu mask is 0000000f  (load 0) 
                CPU number 3  numa_node is -1 (load 0)
                CPU number 1  numa_node is -1 (load 0)
                CPU number 2  numa_node is -1 (load 0)
                CPU number 0  numa_node is -1 (load 0)
IRQ arch_timer(11) guessed as class 0
IRQ zynqmp_ipi(14) guessed as class 0
IRQ fd070000.memory-controller(15) guessed as class 0
IRQ ams-irq(16) guessed as class 0
IRQ arm-pmu(18) guessed as class 0
IRQ arm-pmu(19) guessed as class 0
IRQ arm-pmu(20) guessed as class 0
IRQ arm-pmu(21) guessed as class 0
IRQ xuartps(22) guessed as class 0
IRQ zynqmp-dma(27) guessed as class 0
IRQ zynqmp-dma(28) guessed as class 0
IRQ zynqmp-dma(29) guessed as class 0
IRQ zynqmp-dma(30) guessed as class 0
IRQ zynqmp-dma(31) guessed as class 0
IRQ zynqmp-dma(32) guessed as class 0
IRQ zynqmp-dma(33) guessed as class 0
IRQ zynqmp-dma(34) guessed as class 0
IRQ zynqmp-dma(35) guessed as class 0
IRQ zynqmp-dma(36) guessed as class 0
IRQ zynqmp-dma(37) guessed as class 0
IRQ zynqmp-dma(38) guessed as class 0
IRQ zynqmp-dma(39) guessed as class 0
IRQ zynqmp-dma(40) guessed as class 0
IRQ zynqmp-dma(41) guessed as class 0
IRQ zynqmp-dma(42) guessed as class 0
IRQ fd4c0000.dma-controller(43) guessed as class 0
IRQ gpmmu, ppmmu0, ppmmu1, gp, pp0, pp1(46) guessed as class 0
IRQ ff050000.spi(47) guessed as class 0
IRQ ff0f0000.spi(48) guessed as class 0
IRQ cdns-i2c(49) guessed as class 0
IRQ sl28cpld-gpio-irq(50) guessed as class 0
IRQ sl28cpld-gpio-irq, sl28cpld-gpio-irq(51) guessed as class 0
IRQ abx8xx(52) guessed as class 0
IRQ mmc0(53) guessed as class 0
IRQ pwrseq:clkreq(54) guessed as class 0
IRQ nwl_pcie:misc(56) guessed as class 0
IRQ PCIe PME, aerdrv(59) guessed as class 0
IRQ fd4a0000.display(60) guessed as class 0
IRQ sfp-ge0-mod-def0(61) guessed as class 0
IRQ sfp-ge0-los(62) guessed as class 0
IRQ sfp-ge0-tx-fault(63) guessed as class 0
IRQ sfp-ge1-mod-def0(64) guessed as class 0
IRQ sfp-ge1-los(65) guessed as class 0
IRQ sfp-ge1-tx-fault(66) guessed as class 0
IRQ sfp-ge2-mod-def0(67) guessed as class 0
IRQ sfp-ge2-los(68) guessed as class 0
IRQ sfp-ge2-tx-fault(69) guessed as class 0
IRQ sfp-ge3-mod-def0(70) guessed as class 0
IRQ sfp-ge3-los(71) guessed as class 0
IRQ sfp-ge3-tx-fault(72) guessed as class 0
IRQ net0, net0(73) guessed as class 0
IRQ net1, net1(74) guessed as class 0
IRQ net2, net2(75) guessed as class 0
IRQ net3, net3(76) guessed as class 0
IRQ net4(77) guessed as class 0
IRQ net4(78) guessed as class 0
IRQ net7, net6, net5, net4(79) guessed as class 0
No directory /sys/devices/platform/axienet-80200000:02/: No such file or directory
IRQ axienet-80200000:02(80) guessed as class 0
No directory /sys/devices/platform/axienet-80200000:04/: No such file or directory
IRQ axienet-80200000:04(81) guessed as class 0
No directory /sys/devices/platform/axienet-80200000:05/: No such file or directory
IRQ axienet-80200000:05(82) guessed as class 0
No directory /sys/devices/platform/axienet-80200000:06/: No such file or directory
IRQ axienet-80200000:06(83) guessed as class 0
IRQ net5(84) guessed as class 0
IRQ net5(85) guessed as class 0
IRQ net6(86) guessed as class 0
IRQ net6(87) guessed as class 0
IRQ net7(88) guessed as class 0
IRQ net7(89) guessed as class 0
IRQ dwc3-otg(90) guessed as class 0
IRQ xhci-hcd:usb1(92) guessed as class 0
IRQ touch(93) guessed as class 0
IRQ swdisp(94) guessed as class 0
IRQ brtup(95) guessed as class 0
IRQ brtdn(96) guessed as class 0
IRQ src(97) guessed as class 0
IRQ fault-ack(98) guessed as class 0
IRQ fault-clr(99) guessed as class 0
IRQ sw1(100) guessed as class 0
IRQ sw2(101) guessed as class 0
IRQ sw3(102) guessed as class 0
IRQ sw4(103) guessed as class 0
IRQ power(104) guessed as class 0
IRQ cpu-reset(105) guessed as class 0
IRQ wdt-reset(106) guessed as class 0
Adding IRQ 59 to database
Adding IRQ 11 to database
Adding IRQ 14 to database
Adding IRQ 15 to database
Adding IRQ 16 to database
Adding IRQ 18 to database
Adding IRQ 19 to database
Adding IRQ 20 to database
Adding IRQ 21 to database
Adding IRQ 22 to database
Adding IRQ 27 to database
Adding IRQ 28 to database
Adding IRQ 29 to database
Adding IRQ 30 to database
Adding IRQ 31 to database
Adding IRQ 32 to database
Adding IRQ 33 to database
Adding IRQ 34 to database
Adding IRQ 35 to database
Adding IRQ 36 to database
Adding IRQ 37 to database
Adding IRQ 38 to database
Adding IRQ 39 to database
Adding IRQ 40 to database
Adding IRQ 41 to database
Adding IRQ 42 to database
Adding IRQ 43 to database
Adding IRQ 46 to database
Adding IRQ 47 to database
Adding IRQ 48 to database
Adding IRQ 49 to database
Adding IRQ 50 to database
Adding IRQ 51 to database
Adding IRQ 52 to database
Adding IRQ 53 to database
Adding IRQ 54 to database
Adding IRQ 56 to database
Adding IRQ 60 to database
Adding IRQ 61 to database
Adding IRQ 62 to database
Adding IRQ 63 to database
Adding IRQ 64 to database
Adding IRQ 65 to database
Adding IRQ 66 to database
Adding IRQ 67 to database
Adding IRQ 68 to database
Adding IRQ 69 to database
Adding IRQ 70 to database
Adding IRQ 71 to database
Adding IRQ 72 to database
Adding IRQ 73 to database
Adding IRQ 74 to database
Adding IRQ 75 to database
Adding IRQ 76 to database
Adding IRQ 77 to database
Adding IRQ 78 to database
Adding IRQ 79 to database
Adding IRQ 80 to database
Adding IRQ 81 to database
Adding IRQ 82 to database
Adding IRQ 83 to database
Adding IRQ 84 to database
Adding IRQ 85 to database
Adding IRQ 86 to database
Adding IRQ 87 to database
Adding IRQ 88 to database
Adding IRQ 89 to database
Adding IRQ 90 to database
Adding IRQ 92 to database
Adding IRQ 93 to database
Adding IRQ 94 to database
Adding IRQ 95 to database
Adding IRQ 96 to database
Adding IRQ 97 to database
Adding IRQ 98 to database
Adding IRQ 99 to database
Adding IRQ 100 to database
Adding IRQ 101 to database
Adding IRQ 102 to database
Adding IRQ 103 to database
Adding IRQ 104 to database
Adding IRQ 105 to database
Adding IRQ 106 to database
NUMA NODE NUMBER: -1
LOCAL CPU MASK: ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff

-----------------------------------------------------------------------------
Package 0:  numa_node -1 cpu mask is 0000000f (load 0)
        Cache domain 0:  numa_node is -1 cpu mask is 0000000f  (load 0) 
                CPU number 3  numa_node is -1 (load 0)
                CPU number 1  numa_node is -1 (load 0)
                CPU number 2  numa_node is -1 (load 0)
                CPU number 0  numa_node is -1 (load 0)
          Interrupt 59 node_num is -1 (legacy/0:0) 
  Interrupt 106 node_num is -1 (other/0:0) 
  Interrupt 105 node_num is -1 (other/0:0) 
  Interrupt 104 node_num is -1 (other/0:0) 
  Interrupt 103 node_num is -1 (other/0:0) 
  Interrupt 102 node_num is -1 (other/0:0) 
  Interrupt 101 node_num is -1 (other/0:0) 
  Interrupt 100 node_num is -1 (other/0:0) 
  Interrupt 99 node_num is -1 (other/0:0) 
  Interrupt 98 node_num is -1 (other/0:0) 
  Interrupt 97 node_num is -1 (other/0:0) 
  Interrupt 96 node_num is -1 (other/0:0) 
  Interrupt 95 node_num is -1 (other/0:0) 
  Interrupt 94 node_num is -1 (other/0:0) 
  Interrupt 93 node_num is -1 (other/0:0) 
  Interrupt 92 node_num is -1 (other/0:0) 
  Interrupt 90 node_num is -1 (other/0:0) 
  Interrupt 89 node_num is -1 (other/0:36052) 
  Interrupt 88 node_num is -1 (other/0:131726) 
  Interrupt 87 node_num is -1 (other/0:0) 
  Interrupt 86 node_num is -1 (other/0:0) 
  Interrupt 85 node_num is -1 (other/0:0) 
  Interrupt 84 node_num is -1 (other/0:0) 
  Interrupt 83 node_num is -1 (other/0:0) 
  Interrupt 82 node_num is -1 (other/0:0) 
  Interrupt 81 node_num is -1 (other/0:0) 
  Interrupt 80 node_num is -1 (other/0:0) 
  Interrupt 79 node_num is -1 (other/0:0) 
  Interrupt 78 node_num is -1 (other/0:0) 
  Interrupt 77 node_num is -1 (other/0:0) 
  Interrupt 76 node_num is -1 (other/0:0) 
  Interrupt 75 node_num is -1 (other/0:0) 
  Interrupt 74 node_num is -1 (other/0:642899) 
  Interrupt 73 node_num is -1 (other/0:0) 
  Interrupt 72 node_num is -1 (other/0:0) 
  Interrupt 71 node_num is -1 (other/0:0) 
  Interrupt 70 node_num is -1 (other/0:0) 
  Interrupt 69 node_num is -1 (other/0:0) 
  Interrupt 68 node_num is -1 (other/0:0) 
  Interrupt 67 node_num is -1 (other/0:0) 
  Interrupt 66 node_num is -1 (other/0:0) 
  Interrupt 65 node_num is -1 (other/0:0) 
  Interrupt 64 node_num is -1 (other/0:0) 
  Interrupt 63 node_num is -1 (other/0:0) 
  Interrupt 62 node_num is -1 (other/0:0) 
  Interrupt 61 node_num is -1 (other/0:0) 
  Interrupt 60 node_num is -1 (other/0:0) 
  Interrupt 56 node_num is -1 (other/0:0) 
  Interrupt 54 node_num is -1 (other/0:0) 
  Interrupt 53 node_num is -1 (other/0:312) 
  Interrupt 52 node_num is -1 (other/0:0) 
  Interrupt 51 node_num is -1 (other/0:0) 
  Interrupt 50 node_num is -1 (other/0:0) 
  Interrupt 49 node_num is -1 (other/0:80) 
  Interrupt 48 node_num is -1 (other/0:0) 
  Interrupt 47 node_num is -1 (other/0:0) 
  Interrupt 46 node_num is -1 (other/0:418) 
  Interrupt 43 node_num is -1 (other/0:1156) 
  Interrupt 42 node_num is -1 (other/0:0) 
  Interrupt 41 node_num is -1 (other/0:0) 
  Interrupt 40 node_num is -1 (other/0:0) 
  Interrupt 39 node_num is -1 (other/0:0) 
  Interrupt 38 node_num is -1 (other/0:0) 
  Interrupt 37 node_num is -1 (other/0:0) 
  Interrupt 36 node_num is -1 (other/0:0) 
  Interrupt 35 node_num is -1 (other/0:0) 
  Interrupt 34 node_num is -1 (other/0:0) 
  Interrupt 33 node_num is -1 (other/0:0) 
  Interrupt 32 node_num is -1 (other/0:0) 
  Interrupt 31 node_num is -1 (other/0:0) 
  Interrupt 30 node_num is -1 (other/0:0) 
  Interrupt 29 node_num is -1 (other/0:0) 
  Interrupt 28 node_num is -1 (other/0:0) 
  Interrupt 27 node_num is -1 (other/0:0) 
  Interrupt 22 node_num is -1 (other/0:58) 
  Interrupt 21 node_num is -1 (other/0:0) 
  Interrupt 20 node_num is -1 (other/0:0) 
  Interrupt 19 node_num is -1 (other/0:0) 
  Interrupt 18 node_num is -1 (other/0:0) 
  Interrupt 16 node_num is -1 (other/0:0) 
  Interrupt 15 node_num is -1 (other/0:0) 
  Interrupt 14 node_num is -1 (other/0:0) 
  Interrupt 11 node_num is -1 (other/0:8261) 
nhorman commented 1 month ago

Its probably just simpler to map IRQ_OTHER to the cache level universally, you can try with this patch:

--- a/classify.c
+++ b/classify.c
@@ -26,7 +26,7 @@ char *classes[] = {
 };

 static int map_class_to_level[8] =
-{ BALANCE_PACKAGE, BALANCE_CACHE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE };
+{ BALANCE_CACHE, BALANCE_CACHE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE };

If it works for you ,feel free to open a PR for it.

Though, This isn't really a problem about picking the best balance level, its more a inherent problem with identifying what these platform devices are (or more specifically, what their interrupt volume is going to be), and theres no good solution for that, save for a policy script

Take a look at your debug output. You have about 100 irqs there. two of them have over 100k events on them (likely ethernet devices). two have a few thousand (probably some gpio bus or some such), and 3 have a few hundred (maybe serial ports or other low throughput io). The rest have no volume at all. What irqbalance should do is place the first two on their own cores, place the second two at the cache level, and honestly, depending on the use case, just leave the rest of them alone, or affine them to the package level, because selecting anything more specific is effectively a no-op, as their not impacting cpu usage at all.

Even if the patch above makes your life a bit easier, you're going to want to write a policy script that isolates the high volume interrupts above, and either balances them at the appropriate level, or masks them from irqbalance entirely and manually sets their affinity to the core you want, depending on how specific you want to be.

sean-anderson-seco commented 1 month ago

Its probably just simpler to map IRQ_OTHER to the cache level universally, you can try with this patch:

This will not solve the issue, since (as shown in the above output) there is only one cache domain.

Though, This isn't really a problem about picking the best balance level, its more a inherent problem with identifying what these platform devices are (or more specifically, what their interrupt volume is going to be), and theres no good solution for that, save for a policy script

It doesn't really matter what they are. Since this system has only one package and cache level, BALANCE_PACKAGE and BALANCE_CACHE are identical to BALANCE_NONE. If all IRQs are labeled BALANCE_NONE, this is the same as not running irqbalance at all. So I don't think it matters what kind of IRQs they are; on these kinds of systems we want to use BALANCE_CORE (unless excluded by policy).

Take a look at your debug output. You have about 100 irqs there. two of them have over 100k events on them (likely ethernet devices). two have a few thousand (probably some gpio bus or some such), and 3 have a few hundred (maybe serial ports or other low throughput io). The rest have no volume at all. What irqbalance should do is place the first two on their own cores, place the second two at the cache level, and honestly, depending on the use case, just leave the rest of them alone, or affine them to the package level, because selecting anything more specific is effectively a no-op, as their not impacting cpu usage at all.

Agreed.

Even if the patch above makes your life a bit easier, you're going to want to write a policy script that isolates the high volume interrupts above, and either balances them at the appropriate level, or masks them from irqbalance entirely and manually sets their affinity to the core you want, depending on how specific you want to be.

I don't want to manually set the affinity, because the final usage is unknown. It's entirely possible for the user to connect with different Ethernet ports than what I used in the example above. In that case, a static affinity assignment would be counterproductive, since it could inadvertently place the "active" IRQs on the same cores.

nhorman commented 1 month ago

right, what you need is some mechanism to identify the type of device that interrupt is connected to, and platform devices on arm give you nothing or very little to make that determination. What you want is a script like this:

#bin/bash
PATH_TO_DEVICE=$1

DEVICE_TYPE=$(some command to determine if this is an ethernet or other high volume device)

if [ "$DEVICE_TYPE" == 'High volume device" ]
then
   echo balance_level=core
else
    echo balance_level=package
fi

Just selecting a default of "first topology level that has more than one element" seems like it might be reasonable in this situation, but I'm having a hard time seeing how it makes anything less confusing. You have lots of interrupts whos profile can't be categorized, thats the real problem. I get that having them all balanced at the level of PACKAGE is a bit confusing, but I would find it more confusing to see a high volume interrupt balanced at some level other than CORE. I get in your scenario, it works out to be equivalent since you only have N cores that share a single cache domain, but in other scenarios that may not be the case.

sean-anderson-seco commented 1 month ago

right, what you need is some mechanism to identify the type of device that interrupt is connected to

Why do you need this? What does it matter if a device is "high-volume" or not?

nhorman commented 1 month ago

because setting affinity for irqs that don't need to be balanced to as fine grained a level takes up space the apic table for lots of architectures. Lots of systems out there have thousands of interrupts that get almost no assertions, and you can only affine so many irqs to a cpu.

sean-anderson-seco commented 1 month ago

OK, but on this device there is no APIC table. Every interrupt can be individually enabled for any particular core using dedicated registers. And the default Linux behavior with the affinity set to all cores is to just pick a core to send all the interrupts to [1]. So there is not really a point in trying to determine the function of the interrupt, since we can just balance them all for "free".

[1] As I understand it, the GIC's interrupt distribution capabilities are not very good.

nhorman commented 1 month ago

yes, on your system. Irqbalance runs on a wide range of systems, so I'm really hesitant to do thing like this that will work for you, but may have a significant negative impact on other systems.

I'll tell you what. I have a wide variety of systems here to test on (though I don't currently have an arm based one on hand). If you want to write a PR for this, I'll happily test it on what I have here. If there is no negative impact, I'll pull it in.

It should be fairly straightforward to do, I think: 1) Add a define for BALANCE_SPECIAL (akin to BALANCE_CORE/CACHE/PACKAGE), name it whatever you think is appropriate 2) In find_best_object_for_irq, add a check for BALANCE_SPECIAL on the irq_info->level. If the irq is BALANCE_SPECIAL, skip the case statement check and execute find_best_object 3) in the call to for_each_object(...find_best_object), gate that on another check for BALANCE_SPECIAL, and if its true, in stead set info->level to the type of the child object if the number of children is greater than 1.