ColinIanKing / stress-ng

This is the stress-ng upstream project git repository. stress-ng will stress test a computer system in various selectable ways. It was designed to exercise various physical subsystems of a computer as well as the various operating system kernel interfaces.
https://github.com/ColinIanKing/stress-ng
GNU General Public License v2.0
1.82k stars 290 forks source link

icache can't tested, because signal 11 'SIGSEGV' #447

Closed KianTechHub closed 3 weeks ago

KianTechHub commented 1 month ago

./stress-ng --icache 8192

stress-ng: info: [3703] stressor terminated with unexpected signal 11 'SIGSEGV' stress-ng: info: [3719] stressor terminated with unexpected signal 11 'SIGSEGV' stress-ng: info: [3771] stressor terminated with unexpected signal 11 'SIGSEGV' stress-ng: info: [3802] stressor terminated with unexpected signal 11 'SIGSEGV' stress-ng: info: [3828] stressor terminated with unexpected signal 11 'SIGSEGV' stress-ng: info: [27138] skipped: 0 stress-ng: info: [27138] passed: 4734: icache (4734) stress-ng: info: [27138] failed: 3458: icache (3458) stress-ng: info: [27138] metrics untrustworthy: 0 stress-ng: info: [27138] unsuccessful run completed in 12.22 secs

why the icache be SIGSEGV ?

i used the arm64 Android system, Almost half of all icaches are reported incorrectly My system is 4-core,Changing different parameters to test still reported an error

`console:/data/local/tmp # ./stress-ng --icache 4 stress-ng: info: [3916] defaulting to a 1 day run per stressor stress-ng: info: [3916] dispatching hogs: 4 icache stress-ng: info: [3917] stressor terminated with unexpected signal 11 'SIGSEGV' stress-ng: info: [3916] skipped: 0 stress-ng: info: [3916] passed: 3: icache (3) stress-ng: info: [3916] failed: 1: icache (1) stress-ng: info: [3916] metrics untrustworthy: 0 stress-ng: info: [3916] unsuccessful run completed in 0 secs

2|console:/data/local/tmp # ./stress-ng --icache 1 stress-ng: info: [3921] defaulting to a 1 day run per stressor stress-ng: info: [3921] dispatching hogs: 1 icache stress-ng: info: [3922] stressor terminated with unexpected signal 11 'SIGSEGV' stress-ng: warn: [3921] metrics-check: all bogo-op counters are zero, data may be incorrect stress-ng: info: [3921] skipped: 0 stress-ng: info: [3921] passed: 0 stress-ng: info: [3921] failed: 1: icache (1) stress-ng: info: [3921] metrics untrustworthy: 0 stress-ng: info: [3921] unsuccessful run completed in 0 secs

2|console:/data/local/tmp # ./stress-ng --icache 2 -v stress-ng: debug: [3927] invoked with './stress-ng --icache 2 -v' by user 0 stress-ng: debug: [3927] stress-ng 0.18.05 ga808c8977db7 stress-ng: debug: [3927] system: Linux localhost 5.4.254+ #1 SMP PREEMPT Sat Sep 7 22:27:16 CST 2024 aarch64, clang 17.0.2, unknown libc version, little endian stress-ng: debug: [3927] RAM total: 3.0G, RAM free: 1.1G, swap free: 2.3G stress-ng: debug: [3927] temporary file path: '/data/local/tmp', filesystem type: f2fs (2177311 blocks available) stress-ng: debug: [3927] CPUs have 3 idle states: BUSY, WFI, cpu-sleep-0 stress-ng: debug: [3927] 4 processors online, 4 processors configured stress-ng: info: [3927] defaulting to a 1 day run per stressor stress-ng: debug: [3927] cache allocate: using defaults, cannot determine cache level details stress-ng: debug: [3927] cache allocate: shared cache buffer size: 2048K stress-ng: info: [3927] dispatching hogs: 2 icache stress-ng: debug: [3927] starting stressors stress-ng: debug: [3927] 2 stressors started stress-ng: debug: [3928] icache: [3928] started (instance 0 on CPU 0) stress-ng: debug: [3929] icache: [3929] started (instance 1 on CPU 1) stress-ng: info: [3928] stressor terminated with unexpected signal 11 'SIGSEGV' stress-ng: debug: [3927] icache: [3928] aborted via a termination signal stress-ng: debug: [3927] icache: [3928] terminated (killed by signal) stress-ng: debug: [3929] icache: [3929] exited (instance 1 on CPU 1) stress-ng: debug: [3927] icache: [3929] terminated (success) stress-ng: debug: [3927] metrics-check: all stressor metrics validated and sane stress-ng: info: [3927] skipped: 0 stress-ng: info: [3927] passed: 1: icache (1) stress-ng: info: [3927] failed: 1: icache (1) stress-ng: info: [3927] metrics untrustworthy: 0 stress-ng: info: [3927] unsuccessful run completed in 0 secs

`

ColinIanKing commented 1 month ago

I've just pushed a few more commits to the repository that should be able to catch and debug where the SIGSEGV is occurring. Do you mind pulling these new changes and rebuilding and re-testing so we can figure out where the issue is occurring.

KianTechHub commented 1 month ago

OK,i will do it, As soon as I find something, I will report back

KianTechHub commented 1 month ago

by the way,i modify the code, core-helper.c added #include stress-workload.c added #include

If you do not change, the compiler will report an error, I am using Android ndk static compilation

KianTechHub commented 1 month ago

The following two lines are commented out in the compile-generated config.h, and undefined compilation errors may occur without comments

//#define HAVE_PTHREAD_PRIO_INHERIT //#define HAVE_PTHREAD_PRIO_NONE

KianTechHub commented 1 month ago

this is new tested failed afer pull the latest code at :commit 4a11ac95d6549284df416326a80c4e9db0030740

console:/data/local/tmp # ./stress-ng --icache 2 -v stress-ng: debug: [10770] invoked with './stress-ng --icache 2 -v' by user 0 stress-ng: debug: [10770] stress-ng 0.18.05 g4a11ac95d654 stress-ng: debug: [10770] system: Linux localhost 5.4.254+ #1 SMP PREEMPT Wed May 8 09:34:33 CST 2024 aarch64, clang 17.0.2, unknown libc version, little endian stress-ng: debug: [10770] RAM total: 3.0G, RAM free: 1.9G, swap free: 2.3G stress-ng: debug: [10770] temporary file path: '/data/local/tmp', filesystem type: f2fs (2185794 blocks available) stress-ng: debug: [10770] CPUs have 3 idle states: BUSY, WFI, cpu-sleep-0 stress-ng: debug: [10770] 4 processors online, 4 processors configured stress-ng: info: [10770] defaulting to a 1 day run per stressor stress-ng: debug: [10770] cache allocate: using defaults, cannot determine cache level details stress-ng: debug: [10770] cache allocate: shared cache buffer size: 2048K stress-ng: info: [10770] dispatching hogs: 2 icache stress-ng: debug: [10770] starting stressors stress-ng: debug: [10770] 2 stressors started stress-ng: debug: [10771] icache: [10771] started (instance 0 on CPU 3) stress-ng: debug: [10772] icache: [10772] started (instance 1 on CPU 0) stress-ng: debug: [10771] caught SIGSEGV, address 0x0000000000000f04 (SEGV_MAPERR) stress-ng: debug: [10771] stress-ng: info: 0x0000000000000f00 not readable stress-ng: debug: [10771] stress-ng: info: 0x0000000000000f10 not readable stress-ng: debug: [10771] stress-ng: info: 0x0000000000000f20 not readable stress-ng: error: [10770] icache: [10771] terminated with an error, exit status=2 (stressor failed) stress-ng: debug: [10770] icache: [10771] terminated (stressor failed) stress-ng: debug: [10772] caught SIGSEGV, address 0x0000000000000f04 (SEGV_MAPERR) stress-ng: debug: [10772] stress-ng: info: 0x0000000000000f00 not readable stress-ng: debug: [10772] stress-ng: info: 0x0000000000000f10 not readable stress-ng: debug: [10772] stress-ng: info: 0x0000000000000f20 not readable stress-ng: error: [10770] icache: [10772] terminated with an error, exit status=2 (stressor failed) stress-ng: debug: [10770] icache: [10772] terminated (stressor failed) stress-ng: warn: [10770] metrics-check: all bogo-op counters are zero, data may be incorrect stress-ng: debug: [10770] metrics-check: all stressor metrics validated and sane stress-ng: info: [10770] skipped: 0 stress-ng: info: [10770] passed: 0 stress-ng: info: [10770] failed: 2: icache (2) stress-ng: info: [10770] metrics untrustworthy: 0 stress-ng: info: [10770] unsuccessful run completed in 0 secs

ColinIanKing commented 1 month ago

That's very unexpected. Can you comment out the following two lines in function stress_icache_func() in stress-icache.c, rebuild and re-test and see if the icache flushing is causing the issue:

                    //shim_flush_icache((char *)page, (char *)page + 64);
                    *vaddr = val;
                    //shim_flush_icache((char *)page, (char *)page + 64);
ColinIanKing commented 1 month ago

by the way,i modify the code, core-helper.c added #include stress-workload.c added #include

If you do not change, the compiler will report an error, I am using Android ndk static compilation

Thanks for letting me know, I've added these changes to stress-ng

ColinIanKing commented 1 month ago

The following two lines are commented out in the compile-generated config.h, and undefined compilation errors may occur without comments

//#define HAVE_PTHREAD_PRIO_INHERIT //#define HAVE_PTHREAD_PRIO_NONE

I've fixed this and pushed it to the repo.

KianTechHub commented 1 month ago

The following two lines are commented out in the compile-generated config.h, and undefined compilation errors may occur without comments //#define HAVE_PTHREAD_PRIO_INHERIT //#define HAVE_PTHREAD_PRIO_NONE

I've fixed this and pushed it to the repo.

All compilation errors have been resolved, and it can be successfully compiled without any modifications.

KianTechHub commented 1 month ago

by the way,i modify the code, core-helper.c added #include stress-workload.c added #include If you do not change, the compiler will report an error, I am using Android ndk static compilation

Thanks for letting me know, I've added these changes to stress-ng

All compilation errors have been resolved, and it can be successfully compiled without any modifications.

KianTechHub commented 1 month ago

That's very unexpected. Can you comment out the following two lines in function stress_icache_func() in stress-icache.c, rebuild and re-test and see if the icache flushing is causing the issue:

                    //shim_flush_icache((char *)page, (char *)page + 64);
                    *vaddr = val;
                    //shim_flush_icache((char *)page, (char *)page + 64);

Test results: Some behavior changes occurred, and it seems to have improved compared to before the changes. Some tests were successful. However, during testing, there were instances where the program seemed to freeze, and I had to use Ctrl+C to exit the program.

./stress-ng --icache 4 -v

stress-ng: debug: [2042] invoked with './stress-ng --icache 4 -v' by user 0 stress-ng: debug: [2042] stress-ng 0.18.05 g1cb7016a5151 stress-ng: debug: [2042] system: Linux localhost 5.4.254+ #1 SMP PREEMPT Wed May 8 09:34:33 CST 2024 aarch64, clang 17.0.2, unknown libc version, little endian stress-ng: debug: [2042] RAM total: 3.0G, RAM free: 1.5G, swap free: 2.3G stress-ng: debug: [2042] temporary file path: '/data/local/tmp', filesystem type: f2fs (2185584 blocks available) stress-ng: debug: [2042] CPUs have 3 idle states: BUSY, WFI, cpu-sleep-0 stress-ng: debug: [2042] 4 processors online, 4 processors configured stress-ng: info: [2042] defaulting to a 1 day run per stressor stress-ng: debug: [2042] cache allocate: using defaults, cannot determine cache level details stress-ng: debug: [2042] cache allocate: shared cache buffer size: 2048K stress-ng: info: [2042] dispatching hogs: 4 icache stress-ng: debug: [2042] starting stressors stress-ng: debug: [2043] icache: [2043] started (instance 0 on CPU 1) stress-ng: debug: [2044] icache: [2044] started (instance 1 on CPU 3) stress-ng: debug: [2042] 4 stressors started stress-ng: debug: [2045] icache: [2045] started (instance 2 on CPU 0) stress-ng: debug: [2046] icache: [2046] started (instance 3 on CPU 3) stress-ng: debug: [2045] caught SIGSEGV, address 0x0000000000000f04 (SEGV_MAPERR) stress-ng: debug: [2045] stress-ng: info: 0x0000000000000f00 not readable stress-ng: debug: [2045] stress-ng: info: 0x0000000000000f10 not readable stress-ng: debug: [2045] stress-ng: info: 0x0000000000000f20 not readable stress-ng: debug: [2043] caught SIGSEGV, address 0x0000000000000f04 (SEGV_MAPERR) stress-ng: debug: [2043] stress-ng: info: 0x0000000000000f00 not readable stress-ng: debug: [2043] stress-ng: info: 0x0000000000000f10 not readable stress-ng: debug: [2043] stress-ng: info: 0x0000000000000f20 not readable stress-ng: error: [2042] icache: [2043] terminated with an error, exit status=2 (stressor failed) stress-ng: debug: [2042] icache: [2043] terminated (stressor failed) stress-ng: debug: [2046] caught SIGSEGV, address 0x0000000000000f04 (SEGV_MAPERR) stress-ng: debug: [2046] stress-ng: info: 0x0000000000000f00 not readable stress-ng: debug: [2046] stress-ng: info: 0x0000000000000f10 not readable stress-ng: debug: [2046] stress-ng: info: 0x0000000000000f20 not readable

^C //After pressing Enter continuously without effect, I pressed Ctrl+C, and the program then continued.

stress-ng: debug: [2044] icache: [2044] exited (instance 1 on CPU 3) stress-ng: debug: [2042] icache: [2044] terminated (success) stress-ng: error: [2042] icache: [2045] terminated with an error, exit status=2 (stressor failed) stress-ng: debug: [2042] icache: [2045] terminated (stressor failed) stress-ng: error: [2042] icache: [2046] terminated with an error, exit status=2 (stressor failed) stress-ng: debug: [2042] icache: [2046] terminated (stressor failed) stress-ng: debug: [2042] metrics-check: all stressor metrics validated and sane stress-ng: info: [2042] skipped: 0 stress-ng: info: [2042] passed: 1: icache (1) stress-ng: info: [2042] failed: 3: icache (3) stress-ng: info: [2042] metrics untrustworthy: 0 stress-ng: info: [2042] unsuccessful run completed in 8.63 secs

ColinIanKing commented 1 month ago

OK, can you also comment out:

                        icache_func();
                        //(void)shim_cacheflush((char *)page, page_size, SHIM_ICACHE);

..rebuild and retest to see if the ICACHE flush is causing the issue.

KianTechHub commented 1 month ago

shim_cacheflush

console:/data/local/tmp # /stress-ng --icache 8 -v < stress-ng: debug: [2847] invoked with './stress-ng --icache 8 -v' by user 0 stress-ng: debug: [2847] stress-ng 0.18.05 g1cb7016a5151 stress-ng: debug: [2847] system: Linux localhost 5.4.254+ #1 SMP PREEMPT Wed Oct 23 17:20:23 CST 2024 aarch64, clang 17.0.2, unknown libc version, little endian stress-ng: debug: [2847] RAM total: 3.0G, RAM free: 1.1G, swap free: 2.3G stress-ng: debug: [2847] temporary file path: '/data/local/tmp', filesystem type: f2fs (2187892 blocks available) stress-ng: debug: [2847] CPUs have 3 idle states: BUSY, WFI, cpu-sleep-0 stress-ng: debug: [2847] 4 processors online, 4 processors configured stress-ng: info: [2847] defaulting to a 1 day run per stressor stress-ng: debug: [2847] cache allocate: using defaults, cannot determine cache level details stress-ng: debug: [2847] cache allocate: shared cache buffer size: 2048K stress-ng: info: [2847] dispatching hogs: 8 icache stress-ng: debug: [2847] starting stressors stress-ng: debug: [2848] icache: [2848] started (instance 0 on CPU 3) stress-ng: debug: [2849] icache: [2849] started (instance 1 on CPU 2) stress-ng: debug: [2850] icache: [2850] started (instance 2 on CPU 0) stress-ng: debug: [2851] icache: [2851] started (instance 3 on CPU 2) stress-ng: debug: [2852] icache: [2852] started (instance 4 on CPU 0) stress-ng: debug: [2847] 8 stressors started stress-ng: debug: [2854] icache: [2854] started (instance 6 on CPU 1) stress-ng: debug: [2853] icache: [2853] started (instance 5 on CPU 1) stress-ng: debug: [2855] icache: [2855] started (instance 7 on CPU 1)

^C //After pressing Enter continuously without effect, I pressed Ctrl+C, and the program then continued.

^Cstress-ng: debug: [2850] icache: [2850] exited (instance 2 on CPU 0) stress-ng: debug: [2854] icache: [2854] exited (instance 6 on CPU 3) stress-ng: debug: [2849] icache: [2849] exited (instance 1 on CPU 2) stress-ng: debug: [2855] icache: [2855] exited (instance 7 on CPU 1) stress-ng: debug: [2852] icache: [2852] exited (instance 4 on CPU 0) stress-ng: debug: [2853] icache: [2853] exited (instance 5 on CPU 1) stress-ng: debug: [2851] icache: [2851] exited (instance 3 on CPU 2) stress-ng: debug: [2848] icache: [2848] exited (instance 0 on CPU 3) stress-ng: debug: [2847] icache: [2848] terminated (success) stress-ng: debug: [2847] icache: [2849] terminated (success) stress-ng: debug: [2847] icache: [2850] terminated (success) stress-ng: debug: [2847] icache: [2851] terminated (success) stress-ng: debug: [2847] icache: [2852] terminated (success) stress-ng: debug: [2847] icache: [2853] terminated (success) stress-ng: debug: [2847] icache: [2854] terminated (success) stress-ng: debug: [2847] icache: [2855] terminated (success) stress-ng: debug: [2847] metrics-check: all stressor metrics validated and sane stress-ng: info: [2847] skipped: 0 stress-ng: info: [2847] passed: 8: icache (8) stress-ng: info: [2847] failed: 0 stress-ng: info: [2847] metrics untrustworthy: 0 stress-ng: info: [2847] successful run completed in 4.63 secs

ColinIanKing commented 1 month ago

So this can be either one of two things:

  1. If config.h has HAVE_BUILTIN_CLEAR_CACHE defined then there is an issue with builtin___clear_cache()
  2. otherwise there is an issue with the cacheflush() system call when flushing the instruction cache.

I don't believe this is a stress-ng issue, I think this is an instruction cache flushing issue in the above function/system calls.

KianTechHub commented 1 month ago

OK, I'm not an expert on the icache and I don't have any more findings. Anyway, thank you for your help and troubleshooting

If more testing and verification is needed, I'm happy to cooperate.

ColinIanKing commented 1 month ago

I can only suggest we add some debug in to the cacheflush shim function to see what's happening, in source core_shim.c in function shim_cacheflush() can you add the pr_inf() debug lines as shown below. The new debug code is after the / Add debug ... / comments:

#elif defined(HAVE_BUILTIN___CLEAR_CACHE)
        /* More portable builtin */
        (void)cache;

        /* Add debug clear cache call */
        pr_inf("__builtin___clear_cache(%p,%p)\n", (void *)addr, (void *)(addr + nbytes));
        __builtin___clear_cache((void *)addr, (void *)(addr + nbytes));
        return 0;
#elif defined(__NR_cacheflush) &&       \
      defined(HAVE_SYSCALL)
        /* potentially incorrect args, needs per-arch fixing */

        /* Add debug cacheflush call */
        pr_inf("cacheflush(%p,%d,%d)\n", (void *)addr, nbytes, cache);
        return (int)syscall(__NR_cacheflush, addr, nbytes, cache);
#else
        return (int)shim_enosys(0, addr, nbytes, cache);
#endif
ColinIanKing commented 3 weeks ago

Since this appears to be a kernel or arch specific issue and not a stress-ng issue, I'm going to close this. If you believe this is incorrect, please feel free to reopen this issue.