ColinIanKing / stress-ng

This is the stress-ng upstream project git repository. stress-ng will stress test a computer system in various selectable ways. It was designed to exercise various physical subsystems of a computer as well as the various operating system kernel interfaces.
https://github.com/ColinIanKing/stress-ng
GNU General Public License v2.0
1.82k stars 290 forks source link

The summary of prio-inv has inconsistent with failure message #437

Closed woodrow-shen closed 1 month ago

woodrow-shen commented 1 month ago

Hi @ColinIanKing, we've faced an issue when only running prio-inv with our riscv platform:

Stress-ng versoin: V0.18.00 Kernel version: 6.7.9

root@sifive-fpga:~# stress-ng --seq 0 -t 15 --pathological --verbose --times --tz --prio-inv 4
stress-ng: debug: [1100] invoked with 'stress-ng --seq 0 -t 15 --pathological --verbose --times --tz --prio-inv 4' by user 0 'root'
stress-ng: debug: [1100] stress-ng 0.18.00 gea5b6cd62237
stress-ng: debug: [1100] system: Linux 6.7.9 #1 SMP Fri Sep 27 11:52:50 UTC 2024 riscv64, gcc 13.2.0, glibc 2.37, little endian
stress-ng: debug: [1100] RAM total: 3.8G, RAM free: 3.7G, swap free: 0.0
stress-ng: debug: [1100] temporary file path: '/home/root', filesystem type: ext2 (1184894 blocks available)
stress-ng: debug: [1100] 4 processors online, 4 processors configured
stress-ng: info:  [1100] setting to a 15 secs run per stressor
stress-ng: debug: [1100] cache allocate: using cache maximum level L2
stress-ng: debug: [1100] CPU data cache: L1: 32K, L2: 1024K
stress-ng: debug: [1100] cache allocate: shared cache buffer size: 1024K
stress-ng: info:  [1100] dispatching hogs: 4 prio-inv
stress-ng: debug: [1100] starting stressors
stress-ng: debug: [1101] prio-inv: [1101] started (instance 0 on CPU 2)
stress-ng: debug: [1102] prio-inv: [1102] started (instance 1 on CPU 3)
stress-ng: debug: [1100] 4 stressors started
stress-ng: debug: [1103] prio-inv: [1103] started (instance 2 on CPU 1)
stress-ng: debug: [1104] prio-inv: [1104] started (instance 3 on CPU 0)
stress-ng: fail:  [1102] prio-inv: mutex priority inheritance appears incorrect, low priority process has far more run time (1.93 secs) than high priority process (0.00 secs)
stress-ng: debug: [1101] prio-inv: [1101] exited (instance 0 on CPU 3)
stress-ng: debug: [1102] prio-inv: [1102] exited (instance 1 on CPU 0)
stress-ng: debug: [1103] prio-inv: [1103] exited (instance 2 on CPU 2)
stress-ng: debug: [1104] prio-inv: [1104] exited (instance 3 on CPU 1)
stress-ng: debug: [1100] prio-inv: [1101] terminated (success)
stress-ng: debug: [1100] prio-inv: [1102] terminated (success)
stress-ng: debug: [1100] prio-inv: [1103] terminated (success)
stress-ng: debug: [1100] prio-inv: [1104] terminated (success)
stress-ng: debug: [1100] metrics-check: all stressor metrics validated and sane
stress-ng: info:  [1100] thermal zone temperatures not available
stress-ng: info:  [1100] for a 15.23s run time:
stress-ng: info:  [1100]      60.90s available CPU time
stress-ng: info:  [1100]       5.57s user time   (  9.15%)
stress-ng: info:  [1100]      52.34s system time ( 85.94%)
stress-ng: info:  [1100]      57.91s total time  ( 95.09%)
stress-ng: info:  [1100] load average: 8.05 4.79 2.51
stress-ng: info:  [1100] skipped: 0
stress-ng: info:  [1100] passed: 4: prio-inv (4)
stress-ng: info:  [1100] failed: 0
stress-ng: info:  [1100] metrics untrustworthy: 0
stress-ng: info:  [1100] successful run completed in 15.23 secs

As you can see that failure was reported by prio-inv: mutex priority inheritance appears incorrect, low priority process has far more run time (1.93 secs) than high priority process (0.00 secs), and we're still under investigation, but at the same time we'd like to check if the failure is still reasonable as the final summary gives passed as consequence. Furthermore, we're building the master branch to verify this as well, and I bring this for clarification in advance.

Thanks, Woodrow

ColinIanKing commented 1 month ago

This should not be a failure message, but instead a warning. The heuristics for determining priority inversion failures are based on some scheduler run time stats which are not 100% reliable.

Fix committed:

commit e870dd24cb6af5dcc131298243f70db1d0baec44 (HEAD -> master) Author: Colin Ian King colin.i.king@gmail.com Date: Wed Oct 9 23:10:37 2024 +0100

stress-prio-inv: make priority inheritance error a warning
woodrow-shen commented 1 month ago

@ColinIanKing Thanks for update.

JimmyHoSF commented 1 month ago

Hi @ColinIanKing, do you mean that getrusage is not reliable? In our case, the high-priority process does not show any CPU usage (0.00 secs). Could this also be a case of the miscalculating runtime stats?