No space left on device

balrog-kun commented 5 months ago

In activate_mapping() -ENOSPC is treated as a transient error. However we recently see tens of these on every loop (every 10s) until irqbalance is restarted. I imagine the reason they disappear after restart is because there are some interrupts that are only triggered during boot.

But this leads me to question whether there is a mechanism in irqbalance that would prevent too may IRQs being piled on one core. In theory after enough IRQs are moved to a core, the load on it should go up and it should become less atractive to the logic in irqbalance. However if these IRQs generate little load, is there anything that prevents irqbalance getting stuck trying to move everything to a few cores as I'm seeing?

I checked 2 machines that showed this problem, one had 288 cores and 2400-2800 interrupts in /proc/irq/ depending on drivers, the other had 256 cores and ~1370 interrupts. I inspected some structures in the kernel (not my area) that enforced the per-core limits and found it was arch/x86/kernel/apic/vector.c making use strucures and utilities in kernel/irq/matrix.c. Each core's cpumap->available was initially at 203 but when irq_set_affinity() was returning -ENOSPC, cpumap->available was at 0 for those cores.

nhorman commented 5 months ago

yeah, theres not a lot we can do about this situation. It was discovered a few years ago, And theres no real fix for it. See: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=743dac494d61d991967ebcfab92e4f80dc7583b3

Setting affinity returns ENOSPC on the open descriptor when the APIC that you're using doesn't have enough free space to store a new affinity mask. It happens when a given system has a large number of Interrupt sources and we can't affine them all. You should also see warnings in your systemd journal indicating as much.

Your best workaround is to create a policy script that selects which interrupts are least important to you, and ignore them. The remainder can be affined automatically

balrog-kun commented 5 months ago

Thanks, I guess for now we're going to use --banmod on one specific driver that registers the majority of all the irqs on those systems. FTR I haven't see the "Affinity broken due to vector space exhaustion" message on either machine. It should be printed when the vector is activated, when is that?

As far as what could be done, cpumap->available is currently completely internal to kernel/irq/matrix.c. It is shown in a human-readable format if CONFIG_GENERIC_IRQ_DEBUGFS is enabled. I wonder if it could be exposed in a more useful way so that irqbalance could know that a given core is at its capacity and assign the IRQs to the closest candidate.

In our case the main issue apparently was a performance impact of all the irqbalance errors going into the logs, not fact that the irqs affinity was suboptimal.

nhorman commented 5 months ago

vectors usually get activated prior to return from the set affinity function, but on some apics, the hardware IIRC requires occasional deferral until various hardware conditions are met, check me on that though.

I'm not sure about the availability map, for a few reassons: 1) I'm not 100% sure it actually reflects the availability of a cpu in terms of having space in the LAPIC to accept a new entry 2) Its something of an NP complete problem. That is to say, the status of APIC capacity changes with respect to the migrations we perform, so we will never be able to optimize the mapping

Its interesting that you didn't see the kernel error messages, I would have expected that. I presume your log is set to see WARN level messages? It should be by default. That suggests that there some other condition triggering an ENOSPC return on writes to the irq affinity file. do you see any other references in the log to CPU and vector failures? I'm looking through the various APIC drivers and there are a few conditions under which ENOSPC will be returned for certain bits of hardware

balrog-kun commented 5 months ago

I confirmed, using kgdb, that cm->available is at 0 and it is irq_matrix_alloc returning -ENOSPC.

I understand that when an error is returned from the write() into smp_affinity, the write should have no side effects, meaning that the irq's affinity hasn't changed. This could explain why there's no warning about the effective_affinity not matching smp_affinity but I didn't confirm this. The commit says there's a silent fallback to a wider set of CPUs, where should this happen? Would userspace still see the -ENOSPC in this case?

nhorman commented 5 months ago

You'll have to dig into the kernel code to find that

balrog-kun commented 5 months ago

With irqbalance running and showing the ENOSPC errors, I did the following, mimicking the write that irqbalance was trying to run:

[root@fl41ca201es0502 irqbalance]# cat /proc/irq/4284/smp_affinity
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0**8**000000
[root@fl41ca201es0502 irqbalance]# cat /proc/irq/4284/effective_affinity
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0**8**000000
[root@fl41ca201es0502 irqbalance]# echo 0000**1**000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 > /proc/irq/4284/smp_affinity
-bash: echo: write error: No space left on device
[root@fl41ca201es0502 irqbalance]# cat /proc/irq/4284/effective_affinity
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0**8**000000
[root@fl41ca201es0502 irqbalance]# cat /proc/irq/4284/smp_affinity
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0**8**000000

(the **s were added by me)

So there was no side effect of the failed write, this would by why the affinity mask still matches the effective mask and there's no warning.

It seems that we hit the issue mostly only if that one PCIe driver module (which I now realize registers about 10x the number of interrupts the whole system has without it), is inserted after irqbalance start, and not if irqbalance is restarted after the module load. I wonder if this is because there are a lot of interrupts that irqbalance is not touching, but that are by default assigned to just those cores that irqbalance selected for the few hundred interrupts that it does touch.

I believe that only the 1s in effective_affinity count towards the 256 bits limit in vector.c, not those in smp_affinity since that would trigger the issue even more easily.

balrog-kun commented 5 months ago

If that was the case, the least pretty workaround would be to restart irqbalance after this module is inserted. The slightly better workaround would be to assign all those newly added interrupts to some cores but avoiding touching those that irqbalance is rebalancing anyway and avoiding those cores that irqbalance is rebalancing to, so a little complicated for a shell script. The fix would be for irqbalance to do this, not sure if there's any major technical difficulty with this, I'll see how difficult it would be to do.

balrog-kun commented 5 months ago

I used a script to count interrupts that are on a each core according to effective_affinity and tally numbers of cores by "load" (count of interrupts), used like this: cat /proc/irq/*/effective_affinity | ./per-cpu.py

I compared the scenario of driver module loading while irqbalance is running vs. not running, repeated a few times. It seems that irqbalance unnecessarily does something on this hotplug-like event that worsens the balance:

Scenario 1:

driver module unloaded and irqbalance not running: 1 core(s) have 30 interrupts 1 core(s) have 3 interrupts 38 core(s) have 2 interrupts 151 core(s) have 1 interrupts 97 core(s) have 0 interrupts
after module loads and binds to PCIe devices: 1 core(s) have 46 interrupts 1 core(s) have 11 interrupts 39 core(s) have 10 interrupts 136 core(s) have 9 interrupts 75 core(s) have 8 interrupts 1 core(s) have 6 interrupts 19 core(s) have 5 interrupts 16 core(s) have 4 interrupts
after irqbalance starts: 1 core(s) have 34 interrupts 25 core(s) have 13 interrupts 66 core(s) have 12 interrupts 53 core(s) have 11 interrupts 2 core(s) have 7 interrupts 23 core(s) have 6 interrupts 80 core(s) have 5 interrupts 36 core(s) have 4 interrupts 2 core(s) have 3 interrupts

Scenario 2:

driver module is unloaded and irqbalance is running: 1 core(s) have 30 interrupts 4 core(s) have 3 interrupts 28 core(s) have 2 interrupts 162 core(s) have 1 interrupts 93 core(s) have 0 interrupts
after module loads and binds to PCIe devices (doesn't improve over time): 2 core(s) have 200 interrupts 4 core(s) have 199 interrupts 1 core(s) have 33 interrupts 8 core(s) have 27 interrupts 8 core(s) have 26 interrupts 2 core(s) have 25 interrupts 1 core(s) have 24 interrupts 2 core(s) have 23 interrupts 2 core(s) have 22 interrupts 1 core(s) have 21 interrupts 1 core(s) have 9 interrupts 2 core(s) have 7 interrupts 2 core(s) have 6 interrupts 1 core(s) have 5 interrupts 21 core(s) have 4 interrupts 89 core(s) have 3 interrupts 81 core(s) have 2 interrupts 45 core(s) have 1 interrupts 15 core(s) have 0 interrupts
after irqbalance restart: 1 core(s) have 34 interrupts 23 core(s) have 13 interrupts 71 core(s) have 12 interrupts 49 core(s) have 11 interrupts 1 core(s) have 10 interrupts 4 core(s) have 7 interrupts 19 core(s) have 6 interrupts 80 core(s) have 5 interrupts 40 core(s) have 4 interrupts

nhorman commented 5 months ago

what are the interrupt counts like on those newly registered interrupts? irqblance estimates the load on the cpu of the cumulative number of interrupts affined to each core. If the 200 interrupts on those first two cores in scenario 2 don't generate any load, irqbalance isn't going to see alot of need in rebalancing them

If you want to ensure a blank slate when rebalancing after a module insert, the best solution might be a system-level approach - i.e. add a udev rule to restart irqbalance whenever a module load event occurs

balrog-kun commented 5 months ago

what are the interrupt counts like on those newly registered interrupts?

Without irqbalance: 1 core(s) have 16 interrupts 6 core(s) have 9 interrupts 245 core(s) have 8 interrupts 2 core(s) have 5 interrupts 34 core(s) have 4 interrupts

irqbalance started after module load: 12 core(s) have 12 interrupts 132 core(s) have 11 interrupts 9 core(s) have 5 interrupts 131 core(s) have 4 interrupts 3 core(s) have 3 interrupts 1 core(s) have 2 interrupts

module loaded after irqbalance started: 4 core(s) have 199 interrupts 1 core(s) have 75 interrupts 2 core(s) have 73 interrupts 1 core(s) have 72 interrupts 1 core(s) have 71 interrupts 1 core(s) have 66 interrupts 1 core(s) have 60 interrupts 1 core(s) have 58 interrupts 1 core(s) have 7 interrupts 1 core(s) have 6 interrupts 5 core(s) have 5 interrupts 93 core(s) have 4 interrupts 140 core(s) have 3 interrupts 2 core(s) have 1 interrupts 34 core(s) have 0 interrupts

FTR this were the commands:

ls /sys/bus/pci/drivers/ice/*/msi_irqs/ | sort | uniq > ice-irqs
for i in `cat ice-irqs`; do cat /proc/irq/$i/effective_affinity 2> /dev/null; done | ./per-cpu.py

irqblance estimates the load on the cpu of the cumulative number of interrupts affined to each core. If the 200 interrupts on those first two cores in scenario 2 don't generate any load, irqbalance isn't going to see alot of need in rebalancing them

My understanding is that the reason there are 200 interupts on them is because irqbalance has moved them there. Any idea why this big a difference between the interrupts being added before and after irqbalance start?

If you want to ensure a blank slate when rebalancing after a module insert, the best solution might be a system-level approach - i.e. add a udev rule to restart irqbalance whenever a module load event occurs

I was trying to do this exactly, with some delay between the module loading and irqbalance restart because it may take some time for the interrupts to be registered. But it seemed like irqbalance code itself can (should?) be made to achieve the same thing more efficiently.

nhorman commented 5 months ago

I'm sorry, I'm not being clear - when I say what are the interrupt counts like, I'm asking how many times have those interrupts been asserted, and how much cpu time did those irqs consume?

Yes irqbalance moved them to a given core, bug if the interrupts are registered but never consume cpu time, it doesn't really matter where they are

balrog-kun commented 5 months ago

Here's how many times they were asserted 20 seconds after the module was inserted: 1 irqs triggered 11256 times 1 irqs triggered 7677 times 1 irqs triggered 1163 times 1 irqs triggered 1160 times 1 irqs triggered 857 times 1 irqs triggered 736 times 1 irqs triggered 466 times 1 irqs triggered 459 times 1 irqs triggered 239 times 1 irqs triggered 202 times 1 irqs triggered 151 times 1 irqs triggered 146 times 1 irqs triggered 137 times 1 irqs triggered 81 times 1 irqs triggered 70 times 1 irqs triggered 60 times 1 irqs triggered 52 times 1 irqs triggered 48 times 1 irqs triggered 47 times 1 irqs triggered 33 times 2 irqs triggered 31 times 1 irqs triggered 30 times 1 irqs triggered 28 times 1 irqs triggered 24 times 1 irqs triggered 22 times 2 irqs triggered 19 times 1 irqs triggered 17 times 2 irqs triggered 16 times 2 irqs triggered 15 times 3 irqs triggered 14 times 7 irqs triggered 13 times 6 irqs triggered 12 times 14 irqs triggered 11 times 6 irqs triggered 10 times 9 irqs triggered 9 times 16 irqs triggered 8 times 12 irqs triggered 7 times 5 irqs triggered 6 times 13 irqs triggered 5 times 30 irqs triggered 4 times 12 irqs triggered 3 times 38 irqs triggered 2 times 102 irqs triggered 1 times 1870 irqs triggered 0 times

I don't have the times, I could get the approximation that irqbalance uses but I think we get the picture from the above numbers.

Yes irqbalance moved them to a given core, bug if the interrupts are registered but never consume cpu time, it doesn't really matter where they are

Why move them then? It does matter in respect to this 200 irqs on cpu limit, I wonder if the load calculation in irqbalance could have this added as a factor. The easiest way would be to set the load to maximum when it's close to the limit. The limit is apparently 200 on x86 due to how this is calcualted (256 - number of managed interrupts) but until it can be autodetected it could be set to a sane default like 100 and overridden with on command line. That would allow distributions to set the right value without individual machine user having to care.

nhorman commented 5 months ago

What I'm looking for is the number of time each specific irq was triggered. The output of /proc/interrupts should provide that

balrog-kun commented 5 months ago

I uploaded gzipped /proc/interrupts from about 20s after inserting the module to https://privatebin.io/?ca1aa4053d196458#9Xu9esx15wU2Mcc7mtBDZJ989eZj5yrc8xZGziKmuTEw

My previous comment was a more readable summary of /proc/interrupts assuming you didn't care about the names/numbers of individual IRQs...

nhorman commented 5 months ago

Thank you. I assume its your network card, driven by the ice driver thats in question here? If you look at the proc/interrupts file, for any irq that part of the ice driver (ice-ensXnpY-TxRx-Z), you can see that the number of times those irqs are triggered are zero accross all cpus (or very close to it). IRQbalance is not seeing any cpu load from those irqs, so its leaving them alone until such time as it can determine if they are causing undue load on a given cpu. If you restart irqbalance, it will reset the irq database and give you a more even distribution, but its not really needed after a module reload.

If you start sending high volumes of traffic to one of your ice driver driven network interfaces, you'll see the number of irq triggers go up on the associated queue, and irqbalance will attempt to move it to a less loaded cpu (if one exists)

balrog-kun commented 5 months ago

Thanks for the analysis. Yes, the "ice-*" are the newly added interrupts. (In the summary, where I wrote "1870 irqs triggered 0 times", I included only the interrupts from this driver.)

What I'm seeing though is that irqbalance is in fact moving some or all of these interrupts. I'll try to instrument irqbalance to prove it, but you can see in one of my earlier comments that when I insert the ice module without irqbalance running, the core with the most interrupts on it, has 46. If irqbalance is running, 6 of the cores each have 199 or 200, so at least ~1100 interrupts have been immediately moved by irqbalance.

balrog-kun commented 4 months ago

So I've gone through the placement.c code yesterday and I can see what's happening and I think a few things are wrong.

First of all @nhorman said that IRQs that haven't triggered yet, or have 0 load, should be left alone by irqbalance. But they're not, they're rebalanced both on start and on a hotplug-like event. I think it would make sense not to touch them as basically it's pointless. I'll see if this is easy to fix. Regarding the ENOSPC, it should also work around the issue in my case although it's not a proper fix.
I confirmed that the reason that irqbalance tries to put so many of the "hotplugged" IRQs on just a few cores, or cache domains, is that due to normal randomess some cores or cache domains will have a lower load and the load does not grow when low-load IRQs are moved to them, or it grows way too little to have any effect. So they keep being the best candidate for the next IRQ. The main problem here is that the kernel limits the number of IRQs allowed on a core while irqbalance does not. This sounds like it has to backfire eventually, and with irqbalance being installed by default by a lot of distros this seems to be a significant issue. Instead of improving performance which I believe is irqbalance's goal it worsens performace by flooding the logs. This could be fixed by adding a configurable limit that could be enforced in placement.c:find_best_object() or be factored into d->load.
This is aggravated by activate_mapping() treating ENOSPC as a transient error which it is not. It leaves info->moved at 1 on error and because some of these IRQs may never see any load, they'll never be put back onto rebalance_irq_list and every few secs activate_mapping() will reattempt moving them basically forever. We could easily check if moved is still at 1 after activate_mapping, force rebalancing of these IRQs, remove them from assigned_obj->interrupts and subtract the load from assigned_obj->load and parents. Ideally though, when we hit ENOSPCfor a given core or for a given cache domain, we should save the number of interrupts that we were able to assign to that core without error and treat that as rough maximum in the future. Or at least set a flag on the topo_obj, e.g. assigned_obj->full, that would be cleared only when anything is moved away from the topo_obj.
Why then doesn't this happen on irqbalance start? Basically it lucks out. During the first iteration of placement.c:calculate_placement() every IRQ's load is 0, even if trigger count is non-zero (info->irq_count - info->last_irq_count.) The load on all topo_obj's is 0 too. So placement.c:find_best_object() sees the loads are equal when comparing candidates and falls back to using the number of interrupts assigned so for. This means that all IRQs are initially spread evenly, over all cores and cache domains regardless of the load. All cores have the exact same number of interrupts (+/- 1) on them. This doesn't make much sense either, I wonder if the first call to calculate_placement should be slightly delayed until loads are available.
This also make you wonder whether the classification (IRQ_ETH, IRQ_SCSI, etc.) has any positive effect the way that the classes are used in placement.c. On the face of it the classification feature makes a lot of sense. But placement.c only uses the class to decide the order in which the IRQs are rebalanced. So basically the CPUs and/or cache domains are treated as buckets. Interrupts are sorted by their class and then by load within classes. In that order, each IRQ is put in the least loaded bucket but without regard to what class IRQs are already in that bucket. That bucket is now fuller but as other buckets fill, it becomes the least full again. The next class IRQs are randomly added to those same buckets and will mix. I don't think sorting rebalance_irq_list has the desired effect or any real effect, and that is the only way that the classes are used. Instead perhaps find_best_object should be selecting objects with the highest number of IRQs in the same class, as long as the load on the object is under a threshold. Since rebalancing IRQs shouldn't have a great effect on the total load across all objects (only on IRQ latency), we know the average load we exepect on all object (i.e total load divided by number of objects) and that could be used as a threshold. Might be worth moving to a new github issue.

Irqbalance / irqbalance

No space left on device #303