ColinIanKing / stress-ng

This is the stress-ng upstream project git repository. stress-ng will stress test a computer system in various selectable ways. It was designed to exercise various physical subsystems of a computer as well as the various operating system kernel interfaces.
https://github.com/ColinIanKing/stress-ng
GNU General Public License v2.0
1.77k stars 283 forks source link

trying to bisect a fork issue on sparc64 #115

Closed mator closed 3 years ago

mator commented 3 years ago

Hello!

I'm trying to bisect a fork issue on sparc64 platform which was introduced recently (?)... currently it looks like this:

$ git describe --tag
V0.12.06-13-ge91e93f6
$ ./stress-ng --fork 1  --timeout 10s --metrics-brief
stress-ng: info:  [1725645] dispatching hogs: 1 fork
stress-ng: info:  [1725645] successful run completed in 1.15s
stress-ng: info:  [1725645] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [1725645]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [1725645] fork                825      0.00      0.00      0.00         0.00           0.00
stress-ng: fail:  [1725645] fork instance 0 corrupted bogo-ops counter, 825 vs 0
stress-ng: fail:  [1725645] fork instance 0 hash error in bogo-ops counter and run flag, 603565027 vs 0
stress-ng: fail:  [1725645] metrics-check: stressor metrics corrupted, data is compromised

a "good" version looks like this:

$ git describe --tag 9c5f94ce735dd7ac2be88e24688754a3c0c61130
V0.11.21
$ ./stress-ng --fork 1  --timeout 10s --metrics-brief
stress-ng: info:  [1751827] dispatching hogs: 1 fork
stress-ng: info:  [1751827] successful run completed in 10.00s
stress-ng: info:  [1751827] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [1751827]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [1751827] fork               7727     10.00      3.67      6.78       772.68       739.43

so far, my bisect log look like this:

$ git bisect log | grep -c "git bisect skip"
62

where git bisect skip is used when unable to compile stress-ng with the following error (random non-compilable commit id eb910081 ):

$ git checkout eb910081
$ make
make makeconfig -j1
make[1]: Entering directory '/1/mator/stress-ng-1'
make[1]: Leaving directory '/1/mator/stress-ng-1'
make stress-ng
make[1]: Entering directory '/1/mator/stress-ng-1'
CC core-shim.c
core-shim.c: In function ‘shim_sbrk’:
core-shim.c:1002:9: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
 1002 |  return (void *)shim_enosys(0, increment);
      |         ^
core-shim.c: In function ‘shim_process_madvise’:
core-shim.c:1659:26: error: ‘__process_madvise’ undeclared (first use in this function); did you mean ‘shim_process_madvise’?
 1659 |  return (ssize_t)syscall(__process_madvise, pidfd,
      |                          ^~~~~~~~~~~~~~~~~
      |                          shim_process_madvise
core-shim.c:1659:26: note: each undeclared identifier is reported only once for each function it appears in
core-shim.c:1665:1: warning: control reaches end of non-void function [-Wreturn-type]
 1665 | }
      | ^
make[1]: *** [Makefile:367: core-shim.o] Error 1
make[1]: Leaving directory '/1/mator/stress-ng-1'
make: *** [Makefile:351: all] Error 2

going to finish bisecting, but i wonder - bisect would end up in just compilable version and not the actual git commit with fork issue... any advice so far?

PS: bisect ended up in

$ git bisect skip
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
... [ a full screen list of a commit ids ] ... 
We cannot bisect more!
ColinIanKing commented 3 years ago

It may be worth replacing __process_madvise with __NR_process_madvise on that failed build to see if that helps with the build failure so you can continue with the bisect without the need to skip

ColinIanKing commented 3 years ago

I've been running this on a sparc64 debian QEMU installation with a 4.15 kernel and noticed that sometimes fork() returns the wrong PID and the PID matches that one the existing stressor PID and this gets killed causing the premature stop of the stress test. This only happens to me when I run with 2 or more stressors. It's most bizarre.

ColinIanKing commented 3 years ago

I've pushed a fix for the issues I'm seeing on SPARC64. Perhaps you could pull the latest tip and rebuild and let me know if this helps. What kernel are you using?

mator commented 3 years ago

@ColinIanKing seems like fixed:

$ git desc
V0.12.06-20-g3466c47c

$ ./stress-ng -v --fork 2  --timeout 10s --metrics-brief
stress-ng: debug: [170539] system: Linux ttip 5.12.0-rc5 #204 SMP Mon Mar 29 10:19:44 MSK 2021 sparc64
stress-ng: debug: [170539] RAM total: 33.5G, RAM free: 31.8G, SWAP free: 768.7M
stress-ng: debug: [170539] 24 processors online, 256 processors configured
stress-ng: info:  [170539] dispatching hogs: 2 fork
stress-ng: debug: [170539] cache allocate: using defaults, can't determine cache details from sysfs
stress-ng: debug: [170539] cache allocate: default cache size: 2048K
stress-ng: debug: [170539] starting stressors
stress-ng: debug: [170539] 2 stressors started
stress-ng: debug: [170540] stress-ng-fork: started [170540] (instance 0)
stress-ng: debug: [170541] stress-ng-fork: started [170541] (instance 1)
stress-ng: debug: [170540] stress-ng-fork: exited [170540] (instance 0)
stress-ng: debug: [170541] stress-ng-fork: exited [170541] (instance 1)
stress-ng: debug: [170539] process [170540] terminated
stress-ng: debug: [170539] process [170541] terminated
stress-ng: info:  [170539] successful run completed in 10.00s
stress-ng: info:  [170539] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [170539]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [170539] fork              20208     10.00     13.09      6.39      2020.79        1037.37
stress-ng: debug: [170539] metrics-check: all stressor metrics validated and sane

Thanks.

PS: This https://github.com/strace/strace/commit/c4cff2a7a66629bd95fda9bada84a639c59cda3c could probably explain some sparc64 fork issues....

ColinIanKing commented 3 years ago

Yep, that strace explains it. I was using syscall(__NR_fork) on a random set of the fork calls, hence getting the parent pid and that explains the random killing of the parent stressor. Doh.