Open andrevitorelli opened 3 years ago
Thanks @andrevitorelli for reporting this!
This was missing from the error message above:
Missing ranks:
0: [iFFT3D_2/HorovodAlltoall_iFFT3D_2_stack_0]
@andrevitorelli @EiffL I've just tested fft_benchmark (2 nodes / 8 GPUs) :
by removing everything related to nsys (module load, TMPDIR and nsys profile command). srun python fft_benchmark.py. Not much trace mais looks OK.
with nvprof (also removing everything related to nsys, including the TMPDIR link). Looks OK.
With nsys, results look random... I wonder if nsys and fft_benchmark.py face race conditions when accessing NCCL.
Can you tell me if you get the same results as me?
@andrevitorelli @EiffL I can randomly reproduce the problem on 1 node / 4 GPU : "/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds."... It looks like a deadlock, but when I look at the /gpfsssd/jobscratch directory, there is some activity.
At 11:32 this morning:
total 65538
drwxrwxrwx 7 shpe033 genhpe 4096 May 20 11:31 nsight_systems
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-86e1-748d-0fb1-3554.qdstrm
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-9b20-6dde-455a-f0ee.qdstrm
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-d92d-3d2f-7d85-d28e.qdstrm
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-fd08-6971-007c-fcba.qdstrm
-rw-r----- 1 shpe033 genhpe 0 May 20 11:31 nsys_termination_status_7838
-rw-r----- 1 shpe033 genhpe 0 May 20 11:31 nsys_termination_status_7839
-rw-r----- 1 shpe033 genhpe 0 May 20 11:31 nsys_termination_status_7840
-rw-r----- 1 shpe033 genhpe 0 May 20 11:31 nsys_termination_status_7841
drwx------ 3 shpe033 genhpe 4096 May 20 11:31 ompi.r14i1n1.300328
At 11:40
total 65538
drwxrwxrwx 7 shpe033 genhpe 4096 May 20 11:31 nsight_systems
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-86e1-748d-0fb1-3554.qdstrm
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-9b20-6dde-455a-f0ee.qdstrm
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-d92d-3d2f-7d85-d28e.qdstrm
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-fd08-6971-007c-fcba.qdstrm
-rw-r----- 1 shpe033 genhpe 0 May 20 11:31 nsys_termination_status_7838
-rw-r----- 1 shpe033 genhpe 0 May 20 11:31 nsys_termination_status_7839
-rw-r----- 1 shpe033 genhpe 0 May 20 11:31 nsys_termination_status_7840
-rw-r----- 1 shpe033 genhpe 0 May 20 11:31 nsys_termination_status_7841
drwx------ 3 shpe033 genhpe 4096 May 20 11:31 ompi.r14i1n1.300328
At 11:50
total 65538
drwxrwxrwx 7 shpe033 genhpe 4096 May 20 11:31 nsight_systems
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-86e1-748d-0fb1-3554.qdstrm
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-9b20-6dde-455a-f0ee.qdstrm
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-d92d-3d2f-7d85-d28e.qdstrm
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-fd08-6971-007c-fcba.qdstrm
-rw-r----- 1 shpe033 genhpe 0 May 20 11:31 nsys_termination_status_7838
-rw-r----- 1 shpe033 genhpe 0 May 20 11:31 nsys_termination_status_7839
-rw-r----- 1 shpe033 genhpe 0 May 20 11:31 nsys_termination_status_7840
-rw-r----- 1 shpe033 genhpe 0 May 20 11:31 nsys_termination_status_7841
drwx------ 3 shpe033 genhpe 4096 May 20 11:31 ompi.r14i1n1.300328
I wonder if the fft_benchmark has actually ended but for some reason, the step writing the nsys report on the SCRATCH directory is very slow.
@kimchitsigai @EiffL Some info, comparing a successful run (first) with a deadlock one, I see that the layout rules are different in order. I don't know if this means something, but I'm working on understanding it better.
Successful run, quoting fromfft_0.log
INFO:tensorflow:HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('ny', 'b2'), ('tny', 'b1'), ('tnz', 'b2')}
I0520 15:14:39.514306 22949178982912 hvd_simd_mesh_impl.py:47] HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('ny', 'b2'), ('tny', 'b1'), ('tnz', 'b2')}
Unsuccessful run:
INFO:tensorflow:HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('tnz', 'b2'), ('ny', 'b2'), ('tny', 'b1')}
I0520 15:26:35.604007 22840491303424 hvd_simd_mesh_impl.py:47] HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('tnz', 'b2'), ('ny', 'b2'), ('tny', 'b1')}
The layout rules are in a different order.
Also, the unsucessful run was run on a Tesla V100-SXM2-16GB, but the successful on a Tesla V100-SXM2-32GB (different memory sizes).
I was able to run in a 16GB Tesla V100. I am now testing different/specific nodes.
@andrevitorelli @EiffL Just in case you didn't know, you can set the Horovod layer's verbosity level with
export HOROVOD_LOG_LEVEL=TRACE
in the sbatch file and getting messages like:
[2021-05-21 09:27:04.153366: T /tmp/pip-req-build-rxees3rj/horovod/common/controller.cc:913] Created response of size 0
[2021-05-21 09:27:04.153418: T /tmp/pip-req-build-rxees3rj/horovod/common/controller.cc:428] Sending ready responses as FFT3D/HorovodAlltoall_FFT3D_stack_0;
[2021-05-21 09:27:04.153459: T /tmp/pip-req-build-rxees3rj/horovod/common/operations.cc:654] [0]: Performing FFT3D/HorovodAlltoall_FFT3D_stack_0
[2021-05-21 09:27:04.153474: T /tmp/pip-req-build-rxees3rj/horovod/common/operations.cc:655] [0]: Processing 1 tensors
[2021-05-21 09:27:04.188005: T /tmp/pip-req-build-rxees3rj/horovod/common/ops/nccl_operations.cc:593] [NCCLAlltoall::Execute start, communicator_id=]: 2
(I've addede the NCCL messages myself).
Thanks! I'll keep looking into it. I've eliminated the possibility of it being a node-specific issue. For me, about 70% of the runs fail. A few never return anything, but most get into that deadlock.
Further possibly related insights: https://github.com/horovod/horovod/issues/403
What does not work to solve this bug:
What seems to:
Current hypotheses:
@EiffL @andrevitorelli I've tried the 2 modifications below.
In all the EnqueueTensorAllxxx() functions in horovod/common/operations.cc, message.set_device(device)
is replaced by message.set_device(horovod_global.controller[0]->GetRank())
. When I printed it before this modification, device
value was 0 for all the devices of a communicator. Code is here: https://github.com/DifferentiableUniverseInitiative/horovod/blob/nccl_fixes/horovod/common/operations.cc
In InitNCCLComm() in horovod/common/ops/nccl_operations.cc, nccl_context_->nccl_comms[communicator_id][nccl_device_map]=nccl_comm
line of code is added so that the {device_map, coordinator_id} is inserted in the dictionary. I have also replaced the use of global_state_->current_nccl_stream
by communicator_id
as the value of globalstate->current_nccl_stream does not seem consistent, it should be equal to communicator_id. Code is here: https://github.com/DifferentiableUniverseInitiative/horovod/blob/nccl_fixes/horovod/common/ops/nccl_operations.cc
Test results (with module load tensorflow-gpu/py3/2.4.1+cuda-11.2 nvidia-nsight-systems/2021.1.1
):
Max absolute FFT error... with wall time...
message is present in the 8 log files and qdrep files are OK (can be opened with nsys-ui) but the job does not end and must be scanceled. There are 4 [nsys-launcher] <defunct>
processes on the second machine.Max absolute FFT error... with wall time...
message is present in the 16 log files and. The qdrep files are sometimes produced, sometimes not. When the qdrep files are produced, the 16 processes have the Shutting down background thread
message in their log file. When they are not produced, some of the 16 processes have no Shutting down background thread
message in the log file. The job does not end and must be scanceled. There are 4 [nsys-launcher] <defunct>
processes on each of the 3 last machines.Max absolute FFT error... with wall time...
message is present in the 32 log files and qdrep files are OK (can be opened with nsys-ui) but the job does not end and must be scanceled. There are 4 [nsys-launcher] <defunct>
processes on each of the 7 last machines.Attempt to explain the 4 node / 16 GPU problem. This is a representation of the 16 processes and the 2 sub-communicators, as seen in the logs:
process/global_rank 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
devices[comm_id=1] 0 0 0 0 4 4 4 4 8 8 8 8 12 12 12 12
1 1 1 1 5 5 5 5 9 9 9 9 13 13 13 13
2 2 2 2 6 6 6 6 10 10 10 10 14 14 14 14
3 3 3 3 7 7 7 7 11 11 11 11 15 15 15 15
devices[comm_id=2] 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
4 5 6 7 4 5 6 7 4 5 6 7 4 5 6 7
8 9 10 11 8 9 10 11 8 9 10 11 8 9 10 11
12 13 14 15 12 13 14 15 12 13 14 15 12 13 14 15
Let's suppose that:
So, we get the "Shutting down Background thread loop" message only for processes 12,13,14,15,3,7,11. The other processes are stuck and the qdrep files are not generated.
At the end of fft_benchmark.py, replacing:
exit(0)
by
import horovod.tensorflow as hvd
hvd.shutdown()
time.sleep(5)
exit(0)
seems to give better results. The qdrep files are generated in the 4 above test cases. The jobs must still be scanceled.
When running IDRIS-Hackathon
fft_benchmark.job
I get the following message:W <...>/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
The job never finishes, but doesn't crash.