possible deadlock in clFinish

pszi1ard commented 6 years ago

GROMACS runs that seemed fine before stall and fail to complete since the last ROCm update. Symptoms: with small inputs that run ~100s of microseconds per iteration (one clFinish per iteration), after a few thousand to tens of thousands of iterations the run stalls. Backtrace:

(gdb) bt 
#0  0x00007f0fffd6b827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x2f8f318) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x2f8f318, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007f0fffd6b8d4 in __new_sem_wait_slow (sem=0x2f8f318, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007f0fffd6b97a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
#4  0x00007f0ffb122180 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#5  0x00007f0ffb121fa6 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#6  0x00007f0ffb131f98 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#7  0x00007f0ffb12f8a4 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#8  0x00007f0ffb1129c3 in clFinish () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#9  0x000000000097e3a4 in nbnxn_gpu_try_finish_task ()
#10 0x000000000097f79b in nbnxn_gpu_wait_finish_task ()
#11 0x0000000000947d78 in do_force_cutsVERLET(_IO_FILE*, t_commrec*, t_inputrec*, long, t_nrnb*, gmx_wallcycle*, gmx_localtop_t*, gmx_groups_t*, float (*) [3], gmx::ArrayRef<gmx::BasicVector<float> >, history_t*, gmx::ArrayRef<gmx::BasicVector<float> >, float (*) [3], t_mdatoms*, gmx_enerdata_t*, t_fcdata*, float*, t_graph*, t_forcerec*, interaction_const_t*, gmx_vsite_t*, float*, double, gmx_edsam*, int, int, DdOpenBalanceRegionBeforeForceComputation, DdCloseBalanceRegionAfterForceComputation) [clone .isra.39] ()
#12 0x000000000094a872 in do_force(_IO_FILE*, t_commrec*, t_inputrec*, long, t_nrnb*, gmx_wallcycle*, gmx_localtop_t*, gmx_groups_t*, float (*) [3], gmx::ArrayRef<gmx::BasicVector<float> >, history_t*, gmx::ArrayRef<gmx::BasicVector<float> >, float (*) [3], t_mdatoms*, gmx_enerdata_t*, t_fcdata*, gmx::ArrayRef<float>, t_graph*, t_forcerec*, gmx_vsite_t*, float*, double, gmx_edsam*, int, int, DdOpenBalanceRegionBeforeForceComputation, DdCloseBalanceRegionAfterForceComputation) ()
#13 0x0000000000414832 in gmx::do_md(_IO_FILE*, t_commrec*, gmx::MDLogger const&, int, t_filenm const*, gmx_output_env_t const*, MdrunOptions const&, gmx_vsite_t*, gmx_constr*, gmx::IMDOutputProvider*, t_inputrec*, gmx_mtop_t*, t_fcdata*, t_state*, ObservablesHistory*, gmx::MDAtoms*, t_nrnb*, gmx_wallcycle*, t_forcerec*, ReplicaExchangeParameters const&, gmx_membed_t*, gmx_walltime_accounting*) ()
#14 0x0000000000431b58 in gmx::Mdrunner::mdrunner() ()
#15 0x000000000041aac3 in gmx::Mdrunner::mainFunction(int, char**) ()
#16 0x000000000041b433 in gmx_mdrun(int, char**) ()
#17 0x000000000043ecd3 in gmx::CommandLineModuleManager::run(int, char**) ()
#18 0x000000000040dcfc in main ()

Reproduce with the following command: gmx mdrun -ntmpi 1 -nb gpu -pme cpu -notunepme -nsteps -1 -s water-0000.96.tpr -v (runs inifinte number of iterations, terminate with SIGTERM if needed) Input file can be obtained from here: https://www.dropbox.com/s/627cyb2nzmwrqi0/water-0000.96.tpr

pszi1ard commented 6 years ago

Not sure if related, but in the meantime I see the following messages in dmesg:

[209359.980927] perf: interrupt took too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[258359.848121] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
[258359.848125] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
[258359.848154] pcieport 0000:00:02.0:   device [8086:2f04] error status/mask=00004000/00000000
[258359.848174] pcieport 0000:00:02.0:    [14] Completion Timeout     (First)
[258359.848190] pcieport 0000:00:02.0: broadcast error_detected message
[258359.848193] pcieport 0000:00:02.0: AER: Device recovery failed
[258359.865616] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
[258359.865619] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
[258359.865665] pcieport 0000:00:02.0:   device [8086:2f04] error status/mask=00004000/00000000
[258359.865698] pcieport 0000:00:02.0:    [14] Completion Timeout     (First)
[258359.865726] pcieport 0000:00:02.0: broadcast error_detected message
[258359.865728] pcieport 0000:00:02.0: AER: Device recovery failed

gstoner commented 6 years ago

I have the team looking into this

ROCm / ROCm-OpenCL-Runtime

possible deadlock in clFinish #58