CompFUSE / DCA

DCA++
BSD 3-Clause "New" or "Revised" License
36 stars 28 forks source link

G4 ring test failure on local machine. #205

Closed gbalduzz closed 4 years ago

gbalduzz commented 4 years ago

All tests on current master branch passes on my local machine except ringG_tp_accumulator_gpu_test. There is a limit on how many processes can share the GPU, but with 3 it should not be the case, and it is not the only test that uses MPI and GPU. I attach the output: ring_test_output.txt

gbalduzz commented 4 years ago

Also the input parameters check does not work as intended: the error code at mci_parameters.hpp:288 is ignored. Instead the ring test should not pass as the input file does not satisfy the constraints on the input parameters.

weilewei commented 4 years ago

You might need cuda-aware mpi and use their way to launch app. Also, depending on machines, the cvdlauncher.sh might need slight modification. I don't have a laptop that has GPU in it, and I will check other available machines, and come back later on how we can fix it.

gbalduzz commented 4 years ago

Right, it uses cuda aware MPI. Then I suppose we don't need to fix it as it is a limitation of my system.

gbalduzz commented 4 years ago

It seems that once the parameters reading is fixed, the issue is present also on Daint:

H_0 and H_int initialization start:    29-07-2020 22:55:45
H_0 and H_int initialization end:      29-07-2020 22:55:45
H_0 and H_int initialization duration: 6.616100e-05 s

G_0 initialization start:    29-07-2020 22:55:45
G_0 initialization end:      29-07-2020 22:55:45
G_0 initialization duration: 2.143000e-05 s

xpmem_attach error: : No such file or directory
Rank 2 [Wed Jul 29 22:55:47 2020] [c5-1c1s11n1] Fatal error in PMPI_Wait: Other MPI error, error stack:
PMPI_Wait(207)....................: MPI_Wait(request=0x7ffffffef888, status=0x7ffffffef860) failed
MPIR_Wait_impl(81)................:
MPIDI_CH3I_Progress(568)..........:
pkt_RTS_handler(306)..............:
do_cts(637).......................:
MPID_nem_lmt_xpmem_start_recv(983):
MPID_nem_lmt_send_COOKIE(462).....:
MPID_nem_lmt_send_COOKIE(403).....: xpmem_attach failed on rank 2 (src_rank 1, vaddr 0x31043bce00, len 100352)
/scratch/snx3000/jenks299/workspace-pr-gpu/DCA/test/cvdlauncher.sh: line 23: 24883 Segmentation fault      $cmd $*
/scratch/snx3000/jenks299/workspace-pr-gpu/DCA/test/cvdlauncher.sh: line 23: 24881 Aborted                 $cmd $*
/scratch/snx3000/jenks299/workspace-pr-gpu/DCA/test/cvdlauncher.sh: line 23: 24882 Segmentation fault      $cmd $*
srun: error: nid02989: tasks 0-1: Exited with exit code 139
srun: Terminating job step 24650219.138
srun: error: nid02989: task 2: Exited with exit code 134
weilewei commented 4 years ago

I am also not able to run cuda-aware mpi successfully on Daint, I experienced the same crashes. But if you find the solution, please let me know. Thanks.

gbalduzz commented 4 years ago

replaced by #212