Irqbalance / irqbalance

The irqbalance source tree - The new official site for irqbalance
http://irqbalance.github.io/irqbalance/
GNU General Public License v2.0
576 stars 139 forks source link

irqbalance.service fails to start at boot time or manually. #313

Closed sincorchetes closed 2 months ago

sincorchetes commented 3 months ago

Distribution: Clear Linux with the latest patches irqbalance version: 1.9.3 kernel: 6.8.10-1434.native systemd: systemd opts: +PAM +AUDIT -SELINUX -APPARMOR +IMA -SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 -IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK -PCRE2 +PWQUALITY +P11KIT -QRENCODE +TPM2 +BZIP2 -LZ4 -XZ +ZLIB +ZSTD -BPF_FRAMEWORK +XKBCOMMON +UTMP -SYSVINIT default-hierarchy=hybrid

○ irqbalance.service - irqbalance daemon
     Loaded: loaded (/usr/lib/systemd/system/irqbalance.service; disabled; preset: disabled)
     Active: inactive (dead) since Thu 2024-05-30 11:29:15 CEST; 7s ago
       Docs: man:irqbalance(1)
             https://github.com/Irqbalance/irqbalance
    Process: 19668 ExecStart=/usr/sbin/irqbalance --oneshot (code=exited, status=0/SUCCESS)
   Main PID: 19668 (code=exited, status=0/SUCCESS)

May 30 11:29:15 cr-lan irqbalance[19668]: Cannot change IRQ 165 affinity: Input/output error
May 30 11:29:15 cr-lan irqbalance[19668]: IRQ 165 affinity is now unmanaged
May 30 11:29:15 cr-lan irqbalance[19668]: Cannot change IRQ 171 affinity: Input/output error
May 30 11:29:15 cr-lan irqbalance[19668]: IRQ 171 affinity is now unmanaged
May 30 11:29:15 cr-lan irqbalance[19668]: Cannot change IRQ 168 affinity: Input/output error
May 30 11:29:15 cr-lan irqbalance[19668]: IRQ 168 affinity is now unmanaged
May 30 11:29:15 cr-lan irqbalance[19668]: Cannot change IRQ 166 affinity: Input/output error
May 30 11:29:15 cr-lan irqbalance[19668]: IRQ 166 affinity is now unmanaged
May 30 11:29:15 cr-lan systemd[1]: irqbalance.service: Deactivated successfully.
May 30 11:29:15 cr-lan systemd[1]: Finished irqbalance.service.

Log of journalctl

May 28 23:59:51 cr-lan systemd[1]: Starting irqbalance.service...
May 28 23:59:51 cr-lan irqbalance[620]: Prevent irq assignment to these isolated CPUs: 00000000
May 28 23:59:51 cr-lan irqbalance[620]: Prevent irq assignment to these adaptive-ticks CPUs: 00000000
May 28 23:59:51 cr-lan irqbalance[620]: Prevent irq assignment to these thermal-banned CPUs: 00000000
May 28 23:59:51 cr-lan irqbalance[620]: Banned CPUs: 00000000
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 122 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 154 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 152 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 150 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 149 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 155 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 153 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 151 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 148 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 124 to database
May 28 23:59:51 cr-lan irqbalance[620]: Invalid PCI class code 16711680
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 146 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 125 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 123 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 136 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 126 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 144 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 134 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 142 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 132 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 140 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 130 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 139 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 129 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 137 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 127 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 135 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 143 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 133 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 141 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 131 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 138 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 128 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 164 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 170 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 169 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 167 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 147 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 165 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 171 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 168 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 166 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 145 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 162 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 160 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 159 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 157 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 163 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 161 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 158 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 156 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 1 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 8 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 9 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 14 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 120 to database
May 28 23:59:51 cr-lan irqbalance[620]: Adding IRQ 121 to database
May 28 23:59:51 cr-lan irqbalance[620]: IRQ 154 is removed from interrupts_db.
May 28 23:59:51 cr-lan irqbalance[620]: IRQ 152 is removed from interrupts_db.
May 28 23:59:51 cr-lan irqbalance[620]: IRQ 150 is removed from interrupts_db.
May 28 23:59:51 cr-lan irqbalance[620]: IRQ 149 is removed from interrupts_db.
May 28 23:59:51 cr-lan irqbalance[620]: IRQ 155 is removed from interrupts_db.
May 28 23:59:51 cr-lan irqbalance[620]: IRQ 153 is removed from interrupts_db.
May 28 23:59:51 cr-lan irqbalance[620]: IRQ 151 is removed from interrupts_db.
May 28 23:59:51 cr-lan irqbalance[620]: IRQ 162 is removed from interrupts_db.
May 28 23:59:51 cr-lan irqbalance[620]: IRQ 160 is removed from interrupts_db.
May 28 23:59:51 cr-lan irqbalance[620]: IRQ 159 is removed from interrupts_db.
May 28 23:59:51 cr-lan irqbalance[620]: IRQ 157 is removed from interrupts_db.
May 28 23:59:51 cr-lan irqbalance[620]: IRQ 163 is removed from interrupts_db.
May 28 23:59:51 cr-lan irqbalance[620]: IRQ 161 is removed from interrupts_db.
May 28 23:59:51 cr-lan irqbalance[620]: IRQ 158 is removed from interrupts_db.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: received group id (3).
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 28 23:59:51 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 29 00:00:01 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 29 00:00:01 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 29 00:00:01 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 29 00:00:01 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 29 00:00:01 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 29 00:00:01 cr-lan irqbalance[620]: thermal: no CPU capacity change.
May 29 00:00:01 cr-lan irqbalance[620]: -----------------------------------------------------------------------------
May 29 00:00:01 cr-lan irqbalance[620]: Adding IRQ 16 to database
May 29 00:00:01 cr-lan irqbalance[620]: Hotplug dev irq: 16 finished.
May 29 00:00:01 cr-lan irqbalance[620]: Adding IRQ 27 to database
May 29 00:00:01 cr-lan irqbalance[620]: Hotplug dev irq: 27 finished.
May 29 00:00:01 cr-lan irqbalance[620]: Adding IRQ 172 to database
May 29 00:00:01 cr-lan irqbalance[620]: Hotplug dev irq: 172 finished.
May 29 00:00:01 cr-lan irqbalance[620]: Adding IRQ 173 to database
May 29 00:00:01 cr-lan irqbalance[620]: Adding IRQ 174 to database
May 29 00:00:01 cr-lan irqbalance[620]: Hotplug dev irq: 174 finished.
May 29 00:00:01 cr-lan irqbalance[620]: Adding IRQ 175 to database
May 29 00:00:01 cr-lan irqbalance[620]: Hotplug dev irq: 175 finished.
May 29 00:00:01 cr-lan irqbalance[620]: Adding IRQ 176 to database
May 29 00:00:01 cr-lan irqbalance[620]: Hotplug dev irq: 176 finished.
May 29 00:00:01 cr-lan irqbalance[620]: Cannot change IRQ 164 affinity: Input/output error
May 29 00:00:01 cr-lan irqbalance[620]: IRQ 164 affinity is now unmanaged
May 29 00:00:01 cr-lan irqbalance[620]: Cannot change IRQ 167 affinity: Input/output error
May 29 00:00:01 cr-lan irqbalance[620]: IRQ 167 affinity is now unmanaged
May 29 00:00:01 cr-lan irqbalance[620]: Cannot change IRQ 165 affinity: Input/output error
May 29 00:00:01 cr-lan irqbalance[620]: IRQ 165 affinity is now unmanaged
May 29 00:00:01 cr-lan irqbalance[620]: Cannot change IRQ 171 affinity: Input/output error
May 29 00:00:01 cr-lan irqbalance[620]: IRQ 171 affinity is now unmanaged
May 29 00:00:01 cr-lan irqbalance[620]: Cannot change IRQ 168 affinity: Input/output error
May 29 00:00:01 cr-lan irqbalance[620]: IRQ 168 affinity is now unmanaged
May 29 00:00:01 cr-lan irqbalance[620]: Cannot change IRQ 166 affinity: Input/output error
May 29 00:00:01 cr-lan irqbalance[620]: IRQ 166 affinity is now unmanaged
May 29 00:00:01 cr-lan systemd[1]: irqbalance.service: Deactivated successfully.
May 29 00:00:01 cr-lan systemd[1]: Finished irqbalance.service.
-- Boot 703a98af755343d69eb10d352a90f265 --

lscpu

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   8
  On-line CPU(s) list:    0-7
Vendor ID:                GenuineIntel
  Model name:             11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
    CPU family:           6
    Model:                140
    Thread(s) per core:   2
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             1
    CPU(s) scaling MHz:   38%
    CPU max MHz:          4700.0000
    CPU min MHz:          400.0000
    BogoMIPS:             5606.40
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology 
                          nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_l
                          m abm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx 
                          smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req vnm
                          i avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear ibt flush_l1d arch_capabilities
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    192 KiB (4 instances)
  L1i:                    128 KiB (4 instances)
  L2:                     5 MiB (4 instances)
  L3:                     12 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-7
Vulnerabilities:          
  Gather data sampling:   Mitigation; Microcode
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
  Srbds:                  Not affected
  Tsx async abort:        Not affected
nhorman commented 3 months ago

I'm not sure what problem you are describing here. I see you have 6 irqs that could not have their affinity set, which stems from an I/O error returned from the kernel when their smp_affinity proc files are written. I can't tell you why those irqs return errors when writing, but I suspect the APIC doesn't allow affinity to be set for those for some reason.

can you manually set the affinity for those irqs? If not, then this is either a kernel issue, or working as designed based on what the hardware is using those irqs for

balrog-kun commented 3 months ago

If the issue is that systemctl says inactive (dead), that's because of the --oneshot parameter. Not because of the IO errors, which are apparently normal.

nhorman commented 3 months ago

right, but, as you said, thats an artifact of the --oneshot argument provided, which I presume was added to the irqbalance unit file in the clear linux distribution (the sample service file included in this repo doesn't add that option). systemd handles services that exit after running differently, I think based on the Type field of the unit file. As such, I would think that they should have also added Type=oneshot or some such to the unit file as well to report its status correctly.

sincorchetes commented 3 months ago

So, is it normal behavior then?

This is the service file.

[Unit]
Description=irqbalance daemon
Documentation=man:irqbalance(1)
Documentation=https://github.com/Irqbalance/irqbalance
ConditionVirtualization=!container
ConditionCPUs=>1

[Service]
EnvironmentFile=-/usr/lib/irqbalance/defaults.env
EnvironmentFile=-/path/to/irqbalance.env
ExecStart=/usr/sbin/irqbalance --oneshot
Type=oneshot
ReadOnlyPaths=/
ReadWritePaths=/proc/irq
RestrictAddressFamilies=AF_UNIX AF_NETLINK
RuntimeDirectory=irqbalance/

[Install]
WantedBy=multi-user.target

This directory/file is not exists:

EnvironmentFile=-/usr/lib/irqbalance/defaults.env
EnvironmentFile=-/path/to/irqbalance.env
nhorman commented 3 months ago

well, I don't 100% know, its really a question for systemd people.

From the standpoint of what irqbalance is doing, yes, its absolutely correct. --oneshot as an argument tells irqbalance to make a pass over all the irqs, assign affinity to cpus/packages/nodes/etc according to whatever policy you have configured, and exit, which appears to be what its doing.

From a systemd perspective, a oneshot type of unit should understand that the execed application should be forked, allowed to run, and then exit, which is what appears to be happening

As for what systemd reports as the status of such a unit, I would have thought it would have a status other than inactive/dead, but I could be wrong about that, you would have to ask systemd developers what the expected behavior is there.

liuchao173 commented 2 months ago

It is normal behavior. For the oneshot service, the status after the execution is inactive (dead), which does not mean that the startup fails.