TinkerTools / tinker-hp

Tinker-HP: High-Performance Massively Parallel Evolution of Tinker on CPUs & GPUs
http://tinker-hp.org/
Other
78 stars 24 forks source link

mmff94 - calculation stuck while executing verlet() #10

Closed lflis closed 2 years ago

lflis commented 2 years ago

Dear Tinker Team, One of our users reported problem with stuck jobs. Stdout perspective job is stuck while executing verlet algorithm somewhere after step 200.

There is 24 cpus executing on a single node in MPI mode. Backtraces: 22 of the jobs were waiting for ranks in the reduceen() phase while 2 were stuck earlier on gradinent() execution on image() subroutine

Thread 1 (Thread 0x14f59ebb6380 (LWP 2666218)):
#0  0x000014f595d8fe65 in uct_rc_mlx5_iface_progress_cyclic (arg=<optimized out>) at /usr/include/bits/rc_ep.h:114
#1  0x000014f5960464aa in ucs_callbackq_dispatch (cbq=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at core/ptr_map.inl:211
#2  uct_worker_progress (worker=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at /dev/shm/UCX/1.11.2/GCCcore-11.2.0/ucx-1.11.2/src/ucs/type/thread.h:2592
#3  ucp_worker_progress (worker=0x1927f1b0) at core/ucp_worker.c:2455
#4  0x000014f59fd512f4 in opal_progress () from /net/software/testing/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libopen-pal.so.40
#5  0x000014f5a079506f in ompi_request_default_wait () from /net/software/testing/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi.so.40
#6  0x000014f5a07ef76f in ompi_coll_base_sendrecv_actual () from /net/software/testing/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi.so.40
#7  0x000014f5a07f2d03 in ompi_coll_base_allreduce_intra_recursivedoubling () from /net/software/testing/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi.so.40
#8  0x000014f595a9db02 in ompi_coll_tuned_allreduce_intra_dec_fixed () from /net/software/testing/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/openmpi/mca_coll_tuned.so
#9  0x000014f5a07a95ca in PMPI_Allreduce () from /net/software/testing/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi.so.40
#10 0x000014f5a7ed8369 in pmpi_allreduce__ () from /net/software/testing/software/OpenMPI/4.1.2-intel-compilers-2021.4.0/lib/libmpi_mpifh.so.40
#11 0x000000000068f21b in reduceen (epot=4783.4852974129653) at mpistuff.f:1498
#12 0x00000000009e5710 in verlet (istep=5305, dt=0.001) at verlet.f:129
#13 0x00000000004175c6 in dynamic_bis () at dynamic.f:342
#14 0x0000000000417963 in dynamic () at dynamic.f:23
#15 0x00000000004135e2 in main ()
#16 0x000014f5a0018493 in __libc_start_main () from /lib64/libc.so.6
#17 0x00000000004134ee in _start ()

The problematic tasks:

Thread 1 (Thread 0x153763cf2380 (LWP 2666236)):
#0  0x00000000004d109e in image (xr=-20843875562979260, yr=-2109035950655911, zr=2599181350530971) at image.f:30
#1  0x0000000000a9754a in ecreal1d () at echarge1.f:484
#2  0x0000000000abd285 in echarge1c () at echarge1.f:144
#3  0x0000000000ac3386 in echarge1 () at echarge1.f:26
#4  0x00000000004cd7ca in gradient (energy=3.1647808065992202e+24, derivs=<error reading variable: value requires 100272 bytes, which is more than max-value-size>) at gradient.f:150
#5  0x00000000009e569f in verlet (istep=5305, dt=0.001) at verlet.f:122
#6  0x00000000004175c6 in dynamic_bis () at dynamic.f:342
#7  0x0000000000417963 in dynamic () at dynamic.f:23
#8  0x00000000004135e2 in main ()
#9  0x0000153765154493 in __libc_start_main () from /lib64/libc.so.6
#10 0x00000000004134ee in _start ()

From the debug session we know that program is looping in image() lines 29,30

(gdb) p xr
$1 = -20826228978081308
(gdb) p xcell
$2 = 142.57447518725951
(gdb) p xcell2
$3 = 71.287237593629754

xr value seems not valid.

Do you have any recommendation on where to look for problem?

version: 1.2 compilers: intel-compilers/2021.4.0 mkl 2021.4.0

-- Lukasz Flis

louislagardere commented 2 years ago

Hi, there seems to be something going wrong with the periodic boundary conditions as you pointed out, can you give me access to the associated input files so that I can run tests on my side ?

Thanks, Louis