Closed cwpearson closed 4 years ago
Occasionally, running test_cuda_mpi (669daf1258fb4362d0892cd93098add13c7423cf) on a single-GPU system will deadlock with the following output:
DistributedDomain::ctor(): shmcomm rank 0/1 rank 0/1 local=0 using gpu 0 DistributedDomain::ctor(): rank 0 colocated with 0 ranks gpu distance matrix: 0 10x10x10 of 1x1x1x (gpus 1x1x1) rank,gpu=0,0(gpu actual idx=0) => idx 0 0 0 DistributedDomain.realize(): finished creating LocalDomain [0,0,0] -> [0,0,0] dim=0 dir=-1 dirVec=[-1,0,0] r0,g0 -> r0,g0 DistributedDomain.realize(): dim=0 dir=-1 send same rank DistributedDomain.realize(): dim=0 dir=-1 recv same rank [0,0,0] -> [0,0,0] dim=0 dir=1 dirVec=[1,0,0] r0,g0 -> r0,g0 DistributedDomain.realize(): dim=0 dir=1 send same rank DistributedDomain.realize(): dim=0 dir=1 recv same rank [0,0,0] -> [0,0,0] dim=1 dir=-1 dirVec=[0,-1,0] r0,g0 -> r0,g0 DistributedDomain.realize(): dim=1 dir=-1 send same rank DistributedDomain.realize(): dim=1 dir=-1 recv same rank [0,0,0] -> [0,0,0] dim=1 dir=1 dirVec=[0,1,0] r0,g0 -> r0,g0 DistributedDomain.realize(): dim=1 dir=1 send same rank DistributedDomain.realize(): dim=1 dir=1 recv same rank [0,0,0] -> [0,0,0] dim=2 dir=-1 dirVec=[0,0,-1] r0,g0 -> r0,g0 DistributedDomain.realize(): dim=2 dir=-1 send same rank DistributedDomain.realize(): dim=2 dir=-1 recv same rank [0,0,0] -> [0,0,0] dim=2 dir=1 dirVec=[0,0,1] r0,g0 -> r0,g0 DistributedDomain.realize(): dim=2 dir=1 send same rank DistributedDomain.realize(): dim=2 dir=1 recv same rank FaceSender::send_impl(): send data 0 FaceSender::send_impl(): send data 0 FaceSender::send_impl(): send data 0 FaceSender::send_impl(): send data 0 AnySender::sender(): r0,g0: cudaMemcpy FaceSender::send_impl(): send data 0 AnySender::sender(): r0,g0: cudaMemcpy FaceSender::send_impl(): send data 0 AnySender::sender(): r0,g0: cudaMemcpy AnySender::sender(): r0,g0,d0: Isend 400B -> r0,g0,d0 (tag=20000000) AnySender::sender(): r0,g0: cudaMemcpy AnySender::sender(): r0,g0,d0: Isend 400B -> r0,g0,d0 (tag=02000000) AnySender::wait(): r0,g0: wait on Isend AnySender::sender(): r0,g0,d0: Isend 400B -> r0,g0,d0 (tag=01000000) AnySender::sender(): r0,g0: cudaMemcpy AnySender::wait(): r0,g0: finished Isend AnySender::wait(): r0,g0: wait on Isend AnySender::wait(): r0,g0: wait on Isend AnySender::wait(): r0,g0: finished Isend AnySender::wait(): r0,g0: finished Isend AnySender::sender(): r0,g0: cudaMemcpy AnySender::sender(): r0,g0,d0: Isend 400B -> r0,g0,d0 (tag=08000000) AnyRecver::recver(): r0,g0,d0 Irecv 400B from r0,g0,d0 (tag=20000000) AnyRecver::recver(): r0,g0,d0 Irecv 400B from r0,g0,d0 (tag=02000000) AnyRecver::recver(): r0,g0: wait on Irecv AnyRecver::recver(): r0,g0: got Irecv. cudaMemcpyAsync AnyRecver::recver(): r0,g0,d0 Irecv 400B from r0,g0,d0 (tag=08000000) AnyRecver::recver(): r0,g0: wait on Irecv AnyRecver::recver(): wait for cuda sync AnySender::sender(): r0,g0,d0: Isend 400B -> r0,g0,d0 (tag=10000000) AnySender::wait(): r0,g0: wait on Isend AnySender::wait(): r0,g0: finished Isend AnySender::sender(): r0,g0,d0: Isend 400B -> r0,g0,d0 (tag=04000000) AnyRecver::recver(): r0,g0: wait on Irecv AnyRecver::recver(): r0,g0: got Irecv. cudaMemcpyAsync AnyRecver::recver(): r0,g0,d0 Irecv 400B from r0,g0,d0 (tag=01000000) AnyRecver::recver(): r0,g0: wait on Irecv AnySender::wait(): r0,g0: wait on Isend AnyRecver::recver(): r0,g0,d0 Irecv 400B from r0,g0,d0 (tag=04000000) FaceSender::wait() for fut_ FaceSender::wait() for senders AnyRecver::recver(): wait for cuda sync AnySender::wait(): r0,g0: wait on Isend AnySender::wait(): r0,g0: finished Isend AnyRecver::recver(): r0,g0: wait on Irecv FaceSender::wait() done AnyRecver::recver(): r0,g0,d0 Irecv 400B from r0,g0,d0 (tag=10000000) AnyRecver::recver(): r0,g0: wait on Irecv AnySender::wait(): r0,g0: finished Isend
Appears to be fixed by b7b1a40747ebd386bbed8ca1dd81bf7c4250cf51
Occasionally, running test_cuda_mpi (669daf1258fb4362d0892cd93098add13c7423cf) on a single-GPU system will deadlock with the following output: