Open planetarianPKU opened 9 months ago
first, if you can, try to see if the devel branch version works - maybe this has been fixed already.
also, something I noticed recently is that the underlying updated MPI libraries crash when the code is complied with MPI support (--with-mpi
), but then run as a serial executable ./bin/xspecfem3d
for single NPROC==1 simulations. if that is also your case, you will need to run it with the mpirun
launcher around, like
mpirun -np 1 ./bin/xspecfem3d
and same for the other executables like xmeshfem3d
and xgenerate_databases
Dear Doc. Danielpeter:
Thank you for your very useful advice!
Following your advices, I compiled and ran the devel version you updated yesterday and it works perfectly with no error, that's good.
Then I recompiled my 2018 ver code again with no DEBUG_FLAG, and it show errors again. And I check my sbatch bash, I did run the mpirun:
mpirun -np $NPROC ./bin/xmeshfem3D mpirun -np $NPROC ./bin/xgenerate_databases mpirun -np $NPROC ./bin/xspecfem3D
and there are mpi filles from proc000000_XX to proc000003_XX in the DATABASES_MPI, and I wrote a lot of myrank prints in specfem3d.f90 to monitor the progress of each processes. So I'm sure I did run the mpi program.
Anyway I will use the 2018 verr in DEBUG mode and maybe change my work to new devel version in the future.
Jingnan
no Doc please, just daniel... - we can do a doctor's like surgical operation described below if you like :)
and thanks for the feedback. there can indeed be a problem in the old search_kdtree.f90 code for Intel compilers. as stated in the source code file:
! note: compiling with intel ifort version 18.0.1/19.1.0 and optimizations like -xHost -O2 or -xHost -O3 flags
! can lead to issues with the deallocate(workindex) statement below:
! *** Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x00000000024f1610 ***
!
! this might be due to a more aggressive optimization which leads to a change of the instruction set
! and the memory being free twice.
! a way to avoid this is by removing -xHost from FLAGS_CHECK = .. in Makefile
! or to use a pointer array instead of an allocatable array
!
! integer,dimension(:),allocatable :: workindex
it seems you're using an Intel compiler, so as it says, you can either try
-xHost
flag or src/shared/search_kdtree.f90
with the new one from the recent devel branch where this has been fixedhappy coding :)
Dear daniel:
Following your advice, I copy the search_kdtree.f90 from 2024 devel version to 2018 master version, and still holds the -xHost flag when compiling, and it works.
FLAGS_CHECK = -xHost -fpe0 -ftz -assume buffered_io -assume byterecl -align sequence -std03 -diag-disable 6477 -implicitnone -gen-interfaces -warn all -O3 -check nobounds
Then I compile the unchanged 2018 version without the -xHost flag, it works too.
FLAGS_CHECK = -fpe0 -ftz -assume buffered_io -assume byterecl -align sequence -std03 -diag-disable 6477 -implicitnone -gen-interfaces -warn all -O3 -check nobounds
Codes from both methods are equally fast, running in one-fifth the time of running in DEBUG mode and the snapshots and waveforms of receivers are correct after checking. This really saves most of the time-comsuming and solves a confusion that had been bothering me for a long time —— ( there is no problem with the source code, and there seems no problem with my tiny modifications, so why do I get an error? Wait, why is the source code also reporting an error? When did I change it?). Anyway that problem is fully solved now.
you are very gorgeous and I truly appreciate your help.
Jingnan
Dear SPECFEM3D Team,
write at the front:
I think the error may cause by my cluster not the code, because I have been using SPECFEM3D normally on this cluster for 3years, and recently compiled my code normally on another cluster. But since this thing is so bizarre, I'll record it. I also don’t understand why it can run normally in DEBUG mode
I have been using SPECFEM3D on large-scale clusters for 3 years and am familiar with xspecfem3D forward simulation code. Recently, for some reasons, I needed to recompile my SPECFEM3D version 2018. I was surprised to find that when I recompiled, the xspecfem3D program would report a memory error as follows. This almost never happened before, at least when the code was not changed by myself, I change the division of mesh to try run and get:
Error in `./bin/xspecfem3D': free(): invalid next size (normal): 0x0000000001ed9380 Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x0000000002576ae0 ======= Backtrace: ========= ======= Backtrace: ========= /lib64/libc.so.6(+0x81299)[0x2b3dc2198299] ./bin/xspecfem3D[0x6a2c60] ./bin/xspecfem3D[0x640609] ./bin/xspecfem3D[0x63f4c1] ./bin/xspecfem3D[0x5bad4c] ./bin/xspecfem3D[0x5ca605] /lib64/libc.so.6(+0x81299)[0x2b01c5475299] ./bin/xspecfem3D[0x6a2c60] ./bin/xspecfem3D[0x640609] ./bin/xspecfem3D[0x6406c3] ./bin/xspecfem3D[0x6406c3] ./bin/xspecfem3D[0x6406c3] ./bin/xspecfem3D[0x6406c3] ./bin/xspecfem3D[0x6406c3] ./bin/xspecfem3D[0x63f4c1] ./bin/xspecfem3D[0x5bad4c] ./bin/xspecfem3D[0x5ca605] ./bin/xspecfem3D[0x406062] ./bin/xspecfem3D[0x406062] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3dc2139555] ./bin/xspecfem3D[0x405f69]
or like this:
Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x00000000014cbcc0 ======= Backtrace: ========= /lib64/libc.so.6(+0x81299)[0x2ab901558299] ./bin/xspecfem3D[0x69edd0] ./bin/xspecfem3D[0x63c779] ./bin/xspecfem3D[0x63b631] ./bin/xspecfem3D[0x5ba6bc] ./bin/xspecfem3D[0x5c9890] ./bin/xspecfem3D[0x406062] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab9014f9555] ./bin/xspecfem3D[0x405f69] ======= Memory map: ======== 00400000-007c3000 r-xp 00000000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D 009c2000-009c4000 r--p 003c2000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D 009c4000-009f1000 rw-p 003c4000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D 009f1000-00a93000 rw-p 00000000 00:00 0 0122a000-014e0000 rw-p 00000000 00:00 0 [heap] 2ab8ff26b000-2ab8ff28d000 r-xp 00000000 fd:00 1834 /usr/lib64/ld-2.17.so 2ab8ff28d000-2ab8ff297000 rw-p 00000000 00:00 0 2ab8ff297000-2ab8ff298000 rw-s 003f0000 00:05 43018 /dev/infiniband/uverbs4 2ab8ff298000-2ab8ff299000 rw-s 003f0000 00:05 43015 /dev/infiniband/uverbs1 2ab8ff299000-2ab8ff29a000 rw-s 003f0000 00:05 43016 /dev/infiniband/uverbs2
or like this:
xspecfem3D:73154 terminated with signal 11 at PC=63bdc9 SP=7ffdad54a840. Backtrace:
xspecfem3D:73152 terminated with signal 11 at PC=63bdc9 SP=7fff54347740. Backtrace: Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x00000000012ec4c0 ======= Backtrace: ========= /lib64/libc.so.6(+0x81299)[0x2ab904fcc299] ./bin/xspecfem3D[0x69edd0] ./bin/xspecfem3D[0x63c779] ./bin/xspecfem3D[0x63b631] ./bin/xspecfem3D[0x5ba6bc] ./bin/xspecfem3D[0x5c9890] ./bin/xspecfem3D[0x406062] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab904f6d555] ./bin/xspecfem3D[0x405f69] ======= Memory map: ======== ./bin/xspecfem3D[0x63bdc9] ./bin/xspecfem3D[0x63c8c2] ./bin/xspecfem3D[0x63c833] ./bin/xspecfem3D[0x63c833] ./bin/xspecfem3D[0x63c833] ./bin/xspecfem3D[0x63c833] ./bin/xspecfem3D[0x63c833] ./bin/xspecfem3D[0x63b631] ./bin/xspecfem3D[0x5ba6bc]
To check why this happens, I first test the 2018 ver and 2023 ver source code, the result is similar. Then I write a lot of print information before and after each function in /src/specfem3D/xspecfem3D.f90, to moniter where did it crush. And flinally i found that the program in processors always crushed at
**specfem3D.F90: call setup_sources_receivers().
At this time, I'm very confused because the code that was not changed are also report that erros. I thought that may because in this 3 years the environment of my cluster has changes. After struggling in vain, I made one final attempt, that is add a DEBUGFLAG when compiling:
configure: ./configure FC=/home/opt/intel2020u4/compilers_and_libraries_2020.4.304/linux/bin/intel64/ifort CC=icc MPIFC=mpiifort --with-mpi MPI_INC=/home/opt/intel2020u4/compilers_and_libraries_2020.4.304/linux/mpi/intel64/include in Makefile: DEBUG_COUPLED_FLAG = -check all -debug -g -fp-stack-check -traceback -ftrapuv -xHost -assume byterecl -assume buffered_io -mcmodel=medium -shared-intel
and the The program miraculously returned to normal and I still don't know why. Then I rapidly change some codes to output snapshots that I want and it works. I think the error may cause by my cluster not the code, because I have been using SPECFEM3D normally on this cluster for 3years, and recently compiled my code normally on another cluster. I also don’t understand why it can run normally in DEBUG mode
Now, I continue to happily use SPECFEM3D --- in DEBUG mode. I write this to share the my experience recently when using SPECFEM3D on my cluster.
Regards
Jingnan Sun
planetarian@pku.edu.cn