ColinIanKing / stress-ng

This is the stress-ng upstream project git repository. stress-ng will stress test a computer system in various selectable ways. It was designed to exercise various physical subsystems of a computer as well as the various operating system kernel interfaces.
https://github.com/ColinIanKing/stress-ng
GNU General Public License v2.0
1.82k stars 290 forks source link

Add a new option to sync stressors #397

Closed bra-fsn closed 5 months ago

bra-fsn commented 6 months ago

Currently stress-ng parent starts the number of required stressors (specified with the --cpu X option) and those processes immediately start working. This means that on a machine with a large number of CPUs will have an imbalanced load: there will be an interval at the start of the tests where not all stressors are running (some of them are just starting up, others have already started to work, others are just yet to be started), then at the end the opposite happens: some stressors have already finished and some of them still work.

I propose a different option:

  1. stress-ng should start all stressors
  2. stressors should do their initialization work, signal that they are ready to work and wait for the start signal
  3. the parent process should wait for all stressors to come online
  4. if all stressors are online (initialized, waiting for the signal to start their work), the semaphore should be released and stressors should start their work
ColinIanKing commented 5 months ago

Added new --sync-start option. This required quite a lot of synchronization re-working.

bra-fsn commented 5 months ago

Thanks!

I thought that having synchronised workers will give more consistent results regardless of the test runtime. However, I can see quite the opposite (these are on a 192 core machine):

for i in $(seq 20); do echo -n "$i "; nice -n -20 /tmp/stress-ng --sync-start --metrics --cpu $(nproc) --cpu-method div16 -t $i | awk '/metrc.*cpu/ {print $9" "$11}'; done
1 477869.73 80.43
2 554823.79 93.46
3 520433.70 87.50
4 565221.56 95.04
5 556384.03 93.87
6 566873.13 95.28
7 584442.37 98.21
8 559508.90 94.05
9 586184.74 98.53
10 575939.72 96.80
11 577198.74 97.00
12 588203.51 98.85
13 592025.10 99.51
14 586132.33 98.51
15 577482.38 97.06
16 580788.74 97.62
17 580224.94 97.52
18 589357.12 99.11
19 582089.07 97.84
20 589228.90 99.02

Without the new option:

for i in $(seq 20); do echo -n "$i "; nice -n -20 /tmp/stress-ng --metrics --cpu $(nproc) --cpu-method div16 -t $i | awk '/metrc.*cpu/ {print $9" "$11}'; done
1 585719.85 98.49
2 593026.71 99.68
3 594105.76 99.85
4 594626.59 99.93
5 594701.02 99.94
6 594629.94 99.92
7 594876.16 99.96
8 594485.13 99.92
9 594928.30 99.97
10 594936.09 99.97
11 594941.18 99.98
12 594759.50 99.98
13 594608.77 99.97
14 594314.40 99.97
15 594053.59 99.97
16 594358.35 99.97
17 594617.06 99.98
18 594737.07 99.97
19 594230.50 99.98
20 593840.85 99.94

For reference, this is a full output:

stress-ng: info:  [11777] setting to a 20 secs run per stressor
stress-ng: info:  [11777] dispatching hogs: 192 cpu
stress-ng: metrc: [11777] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [11777]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [11777] cpu            11900410     20.00   3838.93      0.14    595019.41        3099.82        99.98          1536
stress-ng: info:  [11777] skipped: 0
stress-ng: info:  [11777] passed: 192: cpu (192)
stress-ng: info:  [11777] failed: 0
stress-ng: info:  [11777] metrics untrustworthy: 0
stress-ng: info:  [11777] successful run completed in 20.07 secs

And this is with 1 core only:

stress-ng: info:  [11970] setting to a 20 secs run per stressor
stress-ng: info:  [11970] dispatching hogs: 1 cpu
stress-ng: metrc: [11970] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [11970]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [11970] cpu               62030     20.00     20.00      0.00      3101.49        3101.48       100.00          1536
stress-ng: info:  [11970] skipped: 0
stress-ng: info:  [11970] passed: 1: cpu (1)
stress-ng: info:  [11970] failed: 0
stress-ng: info:  [11970] metrics untrustworthy: 0
stress-ng: info:  [11970] successful run completed in 20.00 secs

What did I get wrong? 🤔