cburstedde / p4est

The "p4est" forest-of-octrees library
www.p4est.org/
GNU General Public License v2.0
261 stars 115 forks source link

Ubuntu 24.04 with libmpich-dev fails test_loadsave2 #304

Closed scivision closed 4 months ago

scivision commented 6 months ago

The "test_loadsave2" test passes on Ubuntu 24.04 with OpenMPI, and on other OS (Ubuntu 22.04, macOS, etc) regardless of OpenMPI or MPICH.

However, discovered in #303 and confirmed on a laptop with Ubuntu 24.04 is that "test_loadsave2" fails with MPICH and GCC-12, GCC-13, or GCC-14

Zlib and MPICH were enabled/used.

Wondering if this is just a flaky test or a code update is needed?

$ mpiexec -n 2 test/test_loadsave2

[libsc] This is libsc 2.8.6.999
[libsc] CPP                      /usr/bin/mpicc -E
[libsc] CPPFLAGS
[libsc] CC                       /usr/bin/mpicc
[libsc] CFLAGS                    -flto=auto;-ffat-lto-objects
[libsc] LDFLAGS                  -Wl,-Bsymbolic-functions
[libsc] LIBS                     /usr/lib/x86_64-linux-gnu/libz.so m
[p4est] This is p4est 0.0.0
[p4est] CPP                      /usr/bin/mpicc -E
[p4est] CPPFLAGS
[p4est] CC                       /usr/bin/mpicc
[p4est] CFLAGS                    -flto=auto;-ffat-lto-objects
[p4est] LDFLAGS                  -Wl,-Bsymbolic-functions
[p4est] LIBS                        m
[p4est] Using file names p4est.p4c and p4est.p4p
[p4est] Into p4est_new with min quadrants 0 level 0 uniform 0
[p4est] New p4est with 6 trees on 1 processors
[p4est] Initial level 0 potential global quadrants 6 per tree 1
[p4est] Done p4est_new with 6 total quadrants
[p4est] Into p4est_refine with 6 total quadrants, allowed level 29
[p4est] Done p4est_refine with 116163 total quadrants
[p4est] Into p4est_inflate
[p4est] Done p4est_inflate
[libsc] This is libsc 2.8.6.999
[libsc] CPP                      /usr/bin/mpicc -E
[libsc] CPPFLAGS
[libsc] CC                       /usr/bin/mpicc
[libsc] CFLAGS                    -flto=auto;-ffat-lto-objects
[libsc] LDFLAGS                  -Wl,-Bsymbolic-functions
[libsc] LIBS                     /usr/lib/x86_64-linux-gnu/libz.so m
[p4est] This is p4est 0.0.0
[p4est] CPP                      /usr/bin/mpicc -E
[p4est] CPPFLAGS
[p4est] CC                       /usr/bin/mpicc
[p4est] CFLAGS                    -flto=auto;-ffat-lto-objects
[p4est] LDFLAGS                  -Wl,-Bsymbolic-functions
[p4est] LIBS                        m
[p4est] Using file names p4est.p4c and p4est.p4p
[p4est] Into p4est_new with min quadrants 0 level 0 uniform 0
[p4est] New p4est with 6 trees on 1 processors
[p4est] Initial level 0 potential global quadrants 6 per tree 1
[p4est] Done p4est_new with 6 total quadrants
[p4est] Into p4est_refine with 6 total quadrants, allowed level 29
[p4est] Into p4est_save p4est.p4p
[p4est] Done p4est_refine with 116163 total quadrants
[p4est] Into p4est_inflate
[p4est] Done p4est_inflate
[p4est] Into p4est_save p4est.p4p
[p4est] Done p4est_save
[p4est] Into p4est_load p4est.p4p
[libsc 0] Abort: invalid format
[libsc 0] Abort: /home/box/code_other/p4est/src/p4est.c:3783
[libsc 0] Abort: Obtained 9 stack frames
[libsc 0] Stack 0: test_loadsave2(+0x792d) [0x561c4fda292d]
[libsc 0] Stack 1: test_loadsave2(+0x3f3b) [0x561c4fd9ef3b]
[libsc 0] Stack 2: test_loadsave2(+0x7ac2) [0x561c4fda2ac2]
[libsc 0] Stack 3: test_loadsave2(+0x17aa3) [0x561c4fdb2aa3]
[libsc 0] Stack 4: test_loadsave2(+0x1189c) [0x561c4fdac89c]
[libsc 0] Stack 5: test_loadsave2(+0x2ecc) [0x561c4fd9decc]
[libsc 0] Stack 6: libc.so.6(+0x2a1ca) [0x7f82368301ca]
[libsc 0] Stack 7: libc.so.6(__libc_start_main+0x8b) [0x7f823683028b]
[libsc 0] Stack 8: test_loadsave2(+0x33e5) [0x561c4fd9e3e5]
[p4est] Done p4est_save
[p4est] Into p4est_load p4est.p4p
[libsc 0] Abort: invalid format
[libsc 0] Abort: /home/box/code_other/p4est/src/p4est.c:3783
[libsc 0] Abort: Obtained 9 stack frames
[libsc 0] Stack 0: test_loadsave2(+0x792d) [0x55da28c9b92d]
[libsc 0] Stack 1: test_loadsave2(+0x3f3b) [0x55da28c97f3b]
[libsc 0] Stack 2: test_loadsave2(+0x7ac2) [0x55da28c9bac2]
[libsc 0] Stack 3: test_loadsave2(+0x17aa3) [0x55da28cabaa3]
[libsc 0] Stack 4: test_loadsave2(+0x1189c) [0x55da28ca589c]
[libsc 0] Stack 5: test_loadsave2(+0x2ecc) [0x55da28c96ecc]
[libsc 0] Stack 6: libc.so.6(+0x2a1ca) [0x7fe7b62431ca]
[libsc 0] Stack 7: libc.so.6(__libc_start_main+0x8b) [0x7fe7b624328b]
[libsc 0] Stack 8: test_loadsave2(+0x33e5) [0x55da28c973e5]
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
cburstedde commented 6 months ago

Thanks for the report, pinging @tim-griesbach to check if there may be any unintialized value problem in loading.

tim-griesbach commented 5 months ago

Thanks for the report, pinging @tim-griesbach to check if there may be any unintialized value problem in loading.

Since I do not have access to a machine running with Ubuntu (24.04) I can not reproduce the error locally. Nonetheless, I checked test_loadsave2 locally using valgrind but the program is valgrind clean on my machine.

Hence, I tried to investigate the problem using the CI and I found out that the md5sum of p4est.p4p stays not the same for re-runs in the CI for Ubuntu 24.04 with libmpich-dev. Moreover, I printed the results of MPI_File_get_position in save_ext and these also differ between re-runs. However, for some runs the positions are correct but then the program crashes due to an other problem (segfault).

Given my current observations the issue seems to be caused by a strange MPI behavior but I am not sure about the cause of MPICH's behavior.

Due to the issue described in https://github.com/cburstedde/libsc/issues/191, I can not use valgrind in the CI in combination with Ubuntu 24.04. @scivision Since you were able to reproduce the CI error locally, can you run test_loadsave2 with valgrind?

cburstedde commented 5 months ago

What happens if we remove the gcc version numbers from the CI and use whatever is the default for ubuntu-22/24/latest?

cburstedde commented 4 months ago

It seems to be fine again. In fact, test_loadsave is the one test that trips MPI I/O issues reliably. I have had this fail transiently quite a lot of times in the past for many years.

Still pinging @tim-griesbach for double-checking that this does not have to do with the recent merge on saving a p4est in a more standard conforming way wrt. libc I/O.

tim-griesbach commented 4 months ago

Still pinging @tim-griesbach for double-checking that this does not have to do with the recent merge on saving a p4est in a more standard conforming way wrt. libc I/O

Yes, I double-checked the recent changes in saving a p4est. The changes do not change the md5sum of the created file and I also compared the file positions used for writing and reading and they also do not change with the more standard conforming code. Therefore, the two issues causing the failing test (cf. my report above) are not caused by the recent code changes.

cburstedde commented 4 months ago

Closing as not-a-bug.