[Memory access related errors] Error in `./bin/xspecfem3D': double free or corruption (!prev)

planetarianPKU commented 9 months ago

Dear SPECFEM3D Team,

write at the front:

I think the error may cause by my cluster not the code, because I have been using SPECFEM3D normally on this cluster for 3years, and recently compiled my code normally on another cluster. But since this thing is so bizarre, I'll record it. I also don’t understand why it can run normally in DEBUG mode

I have been using SPECFEM3D on large-scale clusters for 3 years and am familiar with xspecfem3D forward simulation code. Recently, for some reasons, I needed to recompile my SPECFEM3D version 2018. I was surprised to find that when I recompiled, the xspecfem3D program would report a memory error as follows. This almost never happened before, at least when the code was not changed by myself, I change the division of mesh to try run and get:

Error in `./bin/xspecfem3D': free(): invalid next size (normal): 0x0000000001ed9380 Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x0000000002576ae0 ======= Backtrace: ========= ======= Backtrace: ========= /lib64/libc.so.6(+0x81299)[0x2b3dc2198299] ./bin/xspecfem3D[0x6a2c60] ./bin/xspecfem3D[0x640609] ./bin/xspecfem3D[0x63f4c1] ./bin/xspecfem3D[0x5bad4c] ./bin/xspecfem3D[0x5ca605] /lib64/libc.so.6(+0x81299)[0x2b01c5475299] ./bin/xspecfem3D[0x6a2c60] ./bin/xspecfem3D[0x640609] ./bin/xspecfem3D[0x6406c3] ./bin/xspecfem3D[0x6406c3] ./bin/xspecfem3D[0x6406c3] ./bin/xspecfem3D[0x6406c3] ./bin/xspecfem3D[0x6406c3] ./bin/xspecfem3D[0x63f4c1] ./bin/xspecfem3D[0x5bad4c] ./bin/xspecfem3D[0x5ca605] ./bin/xspecfem3D[0x406062] ./bin/xspecfem3D[0x406062] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3dc2139555] ./bin/xspecfem3D[0x405f69]

or like this:

Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x00000000014cbcc0 ======= Backtrace: ========= /lib64/libc.so.6(+0x81299)[0x2ab901558299] ./bin/xspecfem3D[0x69edd0] ./bin/xspecfem3D[0x63c779] ./bin/xspecfem3D[0x63b631] ./bin/xspecfem3D[0x5ba6bc] ./bin/xspecfem3D[0x5c9890] ./bin/xspecfem3D[0x406062] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab9014f9555] ./bin/xspecfem3D[0x405f69] ======= Memory map: ======== 00400000-007c3000 r-xp 00000000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D 009c2000-009c4000 r--p 003c2000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D 009c4000-009f1000 rw-p 003c4000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D 009f1000-00a93000 rw-p 00000000 00:00 0 0122a000-014e0000 rw-p 00000000 00:00 0 [heap] 2ab8ff26b000-2ab8ff28d000 r-xp 00000000 fd:00 1834 /usr/lib64/ld-2.17.so 2ab8ff28d000-2ab8ff297000 rw-p 00000000 00:00 0 2ab8ff297000-2ab8ff298000 rw-s 003f0000 00:05 43018 /dev/infiniband/uverbs4 2ab8ff298000-2ab8ff299000 rw-s 003f0000 00:05 43015 /dev/infiniband/uverbs1 2ab8ff299000-2ab8ff29a000 rw-s 003f0000 00:05 43016 /dev/infiniband/uverbs2

or like this:

xspecfem3D:73154 terminated with signal 11 at PC=63bdc9 SP=7ffdad54a840. Backtrace:

xspecfem3D:73152 terminated with signal 11 at PC=63bdc9 SP=7fff54347740. Backtrace: Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x00000000012ec4c0 ======= Backtrace: ========= /lib64/libc.so.6(+0x81299)[0x2ab904fcc299] ./bin/xspecfem3D[0x69edd0] ./bin/xspecfem3D[0x63c779] ./bin/xspecfem3D[0x63b631] ./bin/xspecfem3D[0x5ba6bc] ./bin/xspecfem3D[0x5c9890] ./bin/xspecfem3D[0x406062] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab904f6d555] ./bin/xspecfem3D[0x405f69] ======= Memory map: ======== ./bin/xspecfem3D[0x63bdc9] ./bin/xspecfem3D[0x63c8c2] ./bin/xspecfem3D[0x63c833] ./bin/xspecfem3D[0x63c833] ./bin/xspecfem3D[0x63c833] ./bin/xspecfem3D[0x63c833] ./bin/xspecfem3D[0x63c833] ./bin/xspecfem3D[0x63b631] ./bin/xspecfem3D[0x5ba6bc]

To check why this happens, I first test the 2018 ver and 2023 ver source code, the result is similar. Then I write a lot of print information before and after each function in /src/specfem3D/xspecfem3D.f90, to moniter where did it crush. And flinally i found that the program in processors always crushed at

**specfem3D.F90: call setup_sources_receivers().

setup_sources_receivers.F90: call setup_search_kdtree() /shared/search_kdtree.f90: call create_kdtree(npoints,points_data,points_index,kdtree, & depth,1,npoints,numnodes,maxdepth) which create_kdtree is a recursive-defined function. recursive subroutine create_kdtree(npoints,points_data,points_index,node, & depth,ibound_lower,ibound_upper,numnodes,maxdepth)**

At this time, I'm very confused because the code that was not changed are also report that erros. I thought that may because in this 3 years the environment of my cluster has changes. After struggling in vain, I made one final attempt, that is add a DEBUGFLAG when compiling:

configure: ./configure FC=/home/opt/intel2020u4/compilers_and_libraries_2020.4.304/linux/bin/intel64/ifort CC=icc MPIFC=mpiifort --with-mpi MPI_INC=/home/opt/intel2020u4/compilers_and_libraries_2020.4.304/linux/mpi/intel64/include in Makefile: DEBUG_COUPLED_FLAG = -check all -debug -g -fp-stack-check -traceback -ftrapuv -xHost -assume byterecl -assume buffered_io -mcmodel=medium -shared-intel

and the The program miraculously returned to normal and I still don't know why. Then I rapidly change some codes to output snapshots that I want and it works. I think the error may cause by my cluster not the code, because I have been using SPECFEM3D normally on this cluster for 3years, and recently compiled my code normally on another cluster. I also don’t understand why it can run normally in DEBUG mode

Now, I continue to happily use SPECFEM3D --- in DEBUG mode. I write this to share the my experience recently when using SPECFEM3D on my cluster.

Regards

Jingnan Sun

planetarian@pku.edu.cn

danielpeter commented 9 months ago

first, if you can, try to see if the devel branch version works - maybe this has been fixed already.

also, something I noticed recently is that the underlying updated MPI libraries crash when the code is complied with MPI support (--with-mpi), but then run as a serial executable ./bin/xspecfem3d for single NPROC==1 simulations. if that is also your case, you will need to run it with the mpirun launcher around, like

mpirun -np 1 ./bin/xspecfem3d

and same for the other executables like xmeshfem3d and xgenerate_databases

planetarianPKU commented 9 months ago

Dear Doc. Danielpeter:

Thank you for your very useful advice！

Following your advices, I compiled and ran the devel version you updated yesterday and it works perfectly with no error, that's good.

Then I recompiled my 2018 ver code again with no DEBUG_FLAG, and it show errors again. And I check my sbatch bash, I did run the mpirun:

mpirun -np $NPROC ./bin/xmeshfem3D mpirun -np $NPROC ./bin/xgenerate_databases mpirun -np $NPROC ./bin/xspecfem3D

and there are mpi filles from proc000000_XX to proc000003_XX in the DATABASES_MPI, and I wrote a lot of myrank prints in specfem3d.f90 to monitor the progress of each processes. So I'm sure I did run the mpi program.

Anyway I will use the 2018 verr in DEBUG mode and maybe change my work to new devel version in the future.

Jingnan

danielpeter commented 9 months ago

no Doc please, just daniel... - we can do a doctor's like surgical operation described below if you like :)

and thanks for the feedback. there can indeed be a problem in the old search_kdtree.f90 code for Intel compilers. as stated in the source code file:

 ! note: compiling with intel ifort version 18.0.1/19.1.0 and optimizations like -xHost -O2 or -xHost -O3 flags
  !       can lead to issues with the deallocate(workindex) statement below:
  !         *** Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x00000000024f1610 ***
  !
  !       this might be due to a more aggressive optimization which leads to a change of the instruction set
  !       and the memory being free twice.
  !       a way to avoid this is by removing -xHost from FLAGS_CHECK = .. in Makefile
  !       or to use a pointer array instead of an allocatable array
  !
  ! integer,dimension(:),allocatable :: workindex

it seems you're using an Intel compiler, so as it says, you can either try

to compile without the -xHost flag or
do a surgical operation and replace your old version file src/shared/search_kdtree.f90 with the new one from the recent devel branch where this has been fixed

happy coding :)

planetarianPKU commented 9 months ago

Dear daniel:

Following your advice, I copy the search_kdtree.f90 from 2024 devel version to 2018 master version, and still holds the -xHost flag when compiling, and it works.

FLAGS_CHECK = -xHost -fpe0 -ftz -assume buffered_io -assume byterecl -align sequence -std03 -diag-disable 6477 -implicitnone -gen-interfaces -warn all -O3 -check nobounds

Then I compile the unchanged 2018 version without the -xHost flag, it works too.

FLAGS_CHECK = -fpe0 -ftz -assume buffered_io -assume byterecl -align sequence -std03 -diag-disable 6477 -implicitnone -gen-interfaces -warn all -O3 -check nobounds Codes from both methods are equally fast, running in one-fifth the time of running in DEBUG mode and the snapshots and waveforms of receivers are correct after checking. This really saves most of the time-comsuming and solves a confusion that had been bothering me for a long time —— ( there is no problem with the source code, and there seems no problem with my tiny modifications, so why do I get an error? Wait, why is the source code also reporting an error? When did I change it?). Anyway that problem is fully solved now.

you are very gorgeous and I truly appreciate your help.

Jingnan

SPECFEM / specfem3d

[Memory access related errors] Error in `./bin/xspecfem3D': double free or corruption (!prev) #1674