IO500 / io500

IO500 Storage Benchmark source code
MIT License
98 stars 31 forks source link

Segmentation fault while trying to run single phase test #23

Closed mrashid2 closed 3 years ago

mrashid2 commented 3 years ago

I am facing this error in the scenario where I tried to run only a single phase of the test. For example, when I tried to run only ior-easy phase(both read and write) what I did is that for other phases of the benchmark test except ior-easy, I have set the noRun variables to TRUE. After executing the command (stonewall-time set to 90 seconds)to run the benchmark, I got the following error:

ERROR INVALID (src/phase_dbg.c)stonewall-time != 300s
IO500 version io500-sc20_v3-6-gd25ea80d54c7
[RESULT-invalid]       ior-easy-write        0.346598 GiB/s : time 90.190 seconds
[RESULT-invalid]    mdtest-easy-write        0.000000 kIOPS : time 0.000 seconds
ERROR INVALID (src/main.c)Runtime of phase (0.000056) is below stonewall time. This shouldn't happen!
[RESULT-invalid]       ior-hard-write        0.000000 GiB/s : time 0.000 seconds
ERROR INVALID (src/main.c)Runtime of phase (0.000054) is below stonewall time. This shouldn't happen!
[RESULT-invalid]    mdtest-hard-write        0.000000 kIOPS : time 0.000 seconds
ERROR INVALID (src/main.c)Runtime of phase (0.000045) is below stonewall time. This shouldn't happen!
[RESULT-invalid]                 find        0.000000 kIOPS : time 0.000 seconds
[RESULT]        ior-easy-read        2.520806 GiB/s : time 12.406 seconds
[RESULT-invalid]     mdtest-easy-stat        0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid]        ior-hard-read        0.000000 GiB/s : time 0.000 seconds
[RESULT-invalid]     mdtest-hard-stat        0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid]   mdtest-easy-delete        0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid]     mdtest-hard-read        0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid]   mdtest-hard-delete        0.000000 kIOPS : time 0.000 seconds
[SCORE-invalid] Bandwidth 0.000000 GiB/s : IOPS 0.000000 kiops : TOTAL 0.000000

The result files are stored in the directory: ./results/2021.04.20-02.44.45
Segmentation fault

After setting the verbosity level to 10 in the configuration, following logs were reported:

; [I] Creating dir ./results/2021.04.20-02.40.07/
; [I] Creating dir ./datafiles/2021.04.20-02.40.07/
ERROR INVALID (src/phase_dbg.c)stonewall-time != 300s
; [I] Creating dir ./datafiles/2021.04.20-02.40.07/ior-easy/
; [I] Creating dir ./datafiles/2021.04.20-02.40.07/ior-hard/
IO500 version io500-sc20_v3-6-gd25ea80d54c7
[RESULT-invalid]       ior-easy-write        0.346252 GiB/s : time 90.180 seconds
[RESULT-invalid]    mdtest-easy-write        0.000000 kIOPS : time 0.000 seconds
ERROR INVALID (src/main.c)Runtime of phase (0.000081) is below stonewall time. This shouldn't happen!
[RESULT-invalid]       ior-hard-write        0.000000 GiB/s : time 0.000 seconds
ERROR INVALID (src/main.c)Runtime of phase (0.000037) is below stonewall time. This shouldn't happen!
[RESULT-invalid]    mdtest-hard-write        0.000000 kIOPS : time 0.000 seconds
ERROR INVALID (src/main.c)Runtime of phase (0.000031) is below stonewall time. This shouldn't happen!
[RESULT-invalid]                 find        0.000000 kIOPS : time 0.000 seconds
[RESULT]        ior-easy-read        2.516911 GiB/s : time 12.410 seconds
[RESULT-invalid]     mdtest-easy-stat        0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid]        ior-hard-read        0.000000 GiB/s : time 0.000 seconds
[RESULT-invalid]     mdtest-hard-stat        0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid]   mdtest-easy-delete        0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid]     mdtest-hard-read        0.000000 kIOPS : time 0.000 seconds
[RESULT-invalid]   mdtest-hard-delete        0.000000 kIOPS : time 0.000 seconds
; [I]  MD = (0.00000000 * 0.00000000 * 0.00000000 * 0.00000000 * 0.00000000 * 0.00000000 * 0.00000000 * 0.00000000)^0.125000
; [I]  BW = (0.34625199 * 0.00000000 * 2.51691134 * 0.00000000)^0.250000
[SCORE-invalid] Bandwidth 0.000000 GiB/s : IOPS 0.000000 kiops : TOTAL 0.000000

The result files are stored in the directory: ./results/2021.04.20-02.40.07
; [I] Removing file ./datafiles/2021.04.20-02.40.07/ior-easy/ior_file_easy.00000000
; [I] Removing dir ./datafiles/2021.04.20-02.40.07/ior-easy
; [I] Removing dir ./datafiles/2021.04.20-02.40.07/mdtest-easy
; [I] Removing file ./datafiles/2021.04.20-02.40.07/ior-hard/file
Fatal error: glibc detected an invalid stdio handle
[node0:157537] *** Process received signal ***
[node0:157537] Signal: Aborted (6)
[node0:157537] Signal code:  (-6)
[node0:157537] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7fd14514e630]
[node0:157537] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7fd144da7387]
[node0:157537] [ 2] /lib64/libc.so.6(abort+0x148)[0x7fd144da8a78]
[node0:157537] [ 3] /lib64/libc.so.6(+0x78ed7)[0x7fd144de9ed7]
[node0:157537] [ 4] /lib64/libc.so.6(__libc_fatal+0x1e)[0x7fd144de9fbe]
[node0:157537] [ 5] /lib64/libc.so.6(+0x79333)[0x7fd144dea333]
[node0:157537] [ 6] /lib64/libc.so.6(fflush+0xf8)[0x7fd144ddf718]
[node0:157537] [ 7] ./io500[0x405f7d]
[node0:157537] [ 8] ./io500[0x409a4e]
[node0:157537] [ 9] ./io500[0x405bff]
[node0:157537] [10] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd144d93555]
[node0:157537] [11] ./io500[0x403c09]
[node0:157537] *** End of error message ***
Aborted
JulianKunkel commented 3 years ago

Thanks for bug reporting. As we will release the new version soon, I won't backport the fix. I believe the following will resolve the issue (and will be part of the new release).

diff --git a/src/main.c b/src/main.c
index 93c2dca..dd53b42 100644
--- a/src/main.c
+++ b/src/main.c
@@ -139,8 +139,12 @@ static double calc_score(double scores[IO500_SCORE_LAST], int extended){
         p += sprintf(p, "%.8f", t);
       }
     }
-    DEBUG_INFO("%s)^%f\n", score_string, 1.0/numbers);
-    score = pow(score, 1.0/numbers);
+    if(numbers == 0){
+      score = 0;
+    }else{      
+      DEBUG_INFO("%s)^%f\n", score_string, 1.0/numbers);
+      score = pow(score, 1.0/numbers);
+    }
     scores[g] = score;
     PRINT_PAIR(io500_phase_str[g], "%f\n", score);
mrashid2 commented 3 years ago

The issue seems not to be producing anymore. Thanks for looking into the issue. It seems that https://github.com/IO500/io500 is not the most the up-to-date repo. However, in the IO500 project page (https://www.vi4io.org/std/io500/start#communication_contribution), it points to the IO500/io500 repo as the source code. Whereas one can get the most up-to-date code from VI4IO/io500. In my opinion, this is kind of inconsistent.

One request, if you have the most updated summary of what the new phase of the benchmark do (i.e. ior-rnd, mdworkbench), please share with me.

ghost commented 3 years ago

I also ran into the "single-phase, segfault" issue, and tracked it slightly further. For me, it was exhibiting classic memory corruption symptoms - sometimes segfaulting, sometimes hanging, etc.

After running through valgind, I discovered that logging is causing a call to fprintf() and passing a stale file pointer.

Basically, in this case, ior-easy runs and creates a log file. It passes the FILE* for this log file to ior_run(), which sets a global variable out_logfile within IOR to this file pointer. After ior_run() finishes, ior-easy then closes that file handle when it is done, but IOR still has a copy of the FILE* in it's out_logfile global variable.

During cleanup, all phases get their cleanup called. During ior-hard's cleanup, it tries to delete files which were never created (because the phase never ran). This causes IOR's POSIX_Delete() method to attempt to print a warning to the out_logfile file handle, which is no longer valid, causing the memory corruption.

Below is some valgrind output:

==23302== Invalid read of size 4
==23302==    at 0x87BB2BB: __fprintf_chk (fprintf_chk.c:30)
==23302==    by 0x17F33D: fprintf (stdio2.h:97)
==23302==    by 0x17F33D: POSIX_Delete (aiori-POSIX.c:645)
==23302==    by 0x1109ED: u_purge_file (util.c:68)
==23302==    by 0x114D62: cleanup (phase_ior_hard.c:39)
==23302==    by 0x1105DB: main (main.c:414)
==23302==  Address 0xbaeb120 is 0 bytes inside a block of size 552 free'd
==23302==    at 0x4C32F78: free (vg_replace_malloc.c:530)
==23302==    by 0x87073DD: fclose@@GLIBC_2.2.5 (iofclose.c:77)
==23302==    by 0x111506: u_res_file_close (util.c:244)
==23302==    by 0x114930: ior_process_read (phase_ior.c:31)
==23302==    by 0x114610: run (phase_ior_easy_read.c:50)
==23302==    by 0x10FBF2: main (main.c:333)
==23302==  Block was alloc'd at
==23302==    at 0x4C31DCB: malloc (vg_replace_malloc.c:299)
==23302==    by 0x8707F29: __fopen_internal (iofopen.c:65)
==23302==    by 0x8707F29: fopen@@GLIBC_2.2.5 (iofopen.c:89)
==23302==    by 0x11146A: u_res_file_prep (util.c:234)
==23302==    by 0x1145F2: run (phase_ior_easy_read.c:49)
==23302==    by 0x10FBF2: main (main.c:333)
JulianKunkel commented 3 years ago

Hi, it appears you are running the io500-sc20 branch (or older), as this would fit the line numbers you are reporting. The bug you mentioned had been fixed in the current main. Is there a reason to run the old code?

ghost commented 3 years ago

My apologies, you are correct. It does appear to be fixed from the main branch.

I'm using the io500-sc20 branch, as that is the branch specifically called out to use from the io500 website's Submission Rules page: https://io500.org/rules-submission. The prose and the example line are in conflict.

  1. Submissions are made using the latest version of the IO500 application in GitHub and all binaries should be built according to the included build instructions.

    $ git clone https://github.com/IO500/io500.git -b io500-sc20