DifferentiableUniverseInitiative / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
https://eng.uber.com/horovod/
Other
0 stars 0 forks source link

multiple_communicators branch gets deadlock on Alltoall #5

Open andrevitorelli opened 3 years ago

andrevitorelli commented 3 years ago

When running IDRIS-Hackathon fft_benchmark.job I get the following message:

W <...>/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.

The job never finishes, but doesn't crash.

EiffL commented 3 years ago

Thanks @andrevitorelli for reporting this!

andrevitorelli commented 3 years ago

This was missing from the error message above:

Missing ranks:
0: [iFFT3D_2/HorovodAlltoall_iFFT3D_2_stack_0]
kimchitsigai commented 3 years ago

@andrevitorelli @EiffL I've just tested fft_benchmark (2 nodes / 8 GPUs) :

With nsys, results look random... I wonder if nsys and fft_benchmark.py face race conditions when accessing NCCL.

Can you tell me if you get the same results as me?

kimchitsigai commented 3 years ago

@andrevitorelli @EiffL I can randomly reproduce the problem on 1 node / 4 GPU : "/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds."... It looks like a deadlock, but when I look at the /gpfsssd/jobscratch directory, there is some activity.

At 11:32 this morning:

total 65538
drwxrwxrwx 7 shpe033 genhpe   4096 May 20 11:31 nsight_systems
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-86e1-748d-0fb1-3554.qdstrm
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-9b20-6dde-455a-f0ee.qdstrm
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-d92d-3d2f-7d85-d28e.qdstrm
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-fd08-6971-007c-fcba.qdstrm
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7838
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7839
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7840
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7841
drwx------ 3 shpe033 genhpe   4096 May 20 11:31 ompi.r14i1n1.300328

At 11:40

total 65538
drwxrwxrwx 7 shpe033 genhpe   4096 May 20 11:31 nsight_systems
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-86e1-748d-0fb1-3554.qdstrm
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-9b20-6dde-455a-f0ee.qdstrm
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-d92d-3d2f-7d85-d28e.qdstrm
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-fd08-6971-007c-fcba.qdstrm
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7838
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7839
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7840
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7841
drwx------ 3 shpe033 genhpe   4096 May 20 11:31 ompi.r14i1n1.300328

At 11:50

total 65538
drwxrwxrwx 7 shpe033 genhpe    4096 May 20 11:31 nsight_systems
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-86e1-748d-0fb1-3554.qdstrm
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-9b20-6dde-455a-f0ee.qdstrm
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-d92d-3d2f-7d85-d28e.qdstrm
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-fd08-6971-007c-fcba.qdstrm
-rw-r----- 1 shpe033 genhpe       0 May 20 11:31 nsys_termination_status_7838
-rw-r----- 1 shpe033 genhpe       0 May 20 11:31 nsys_termination_status_7839
-rw-r----- 1 shpe033 genhpe       0 May 20 11:31 nsys_termination_status_7840
-rw-r----- 1 shpe033 genhpe       0 May 20 11:31 nsys_termination_status_7841
drwx------ 3 shpe033 genhpe    4096 May 20 11:31 ompi.r14i1n1.300328

I wonder if the fft_benchmark has actually ended but for some reason, the step writing the nsys report on the SCRATCH directory is very slow.

andrevitorelli commented 3 years ago

@kimchitsigai @EiffL Some info, comparing a successful run (first) with a deadlock one, I see that the layout rules are different in order. I don't know if this means something, but I'm working on understanding it better.

Successful run, quoting fromfft_0.log


INFO:tensorflow:HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('ny', 'b2'), ('tny', 'b1'), ('tnz', 'b2')}
I0520 15:14:39.514306 22949178982912 hvd_simd_mesh_impl.py:47] HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('ny', 'b2'), ('tny', 'b1'), ('tnz', 'b2')}

Unsuccessful run:

INFO:tensorflow:HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('tnz', 'b2'), ('ny', 'b2'), ('tny', 'b1')}
I0520 15:26:35.604007 22840491303424 hvd_simd_mesh_impl.py:47] HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('tnz', 'b2'), ('ny', 'b2'), ('tny', 'b1')}

The layout rules are in a different order.

Also, the unsucessful run was run on a Tesla V100-SXM2-16GB, but the successful on a Tesla V100-SXM2-32GB (different memory sizes). I was able to run in a 16GB Tesla V100. I am now testing different/specific nodes.

kimchitsigai commented 3 years ago

@andrevitorelli @EiffL Just in case you didn't know, you can set the Horovod layer's verbosity level with export HOROVOD_LOG_LEVEL=TRACE in the sbatch file and getting messages like:

[2021-05-21 09:27:04.153366: T /tmp/pip-req-build-rxees3rj/horovod/common/controller.cc:913] Created response of size 0
[2021-05-21 09:27:04.153418: T /tmp/pip-req-build-rxees3rj/horovod/common/controller.cc:428] Sending ready responses as FFT3D/HorovodAlltoall_FFT3D_stack_0;
[2021-05-21 09:27:04.153459: T /tmp/pip-req-build-rxees3rj/horovod/common/operations.cc:654] [0]: Performing FFT3D/HorovodAlltoall_FFT3D_stack_0
[2021-05-21 09:27:04.153474: T /tmp/pip-req-build-rxees3rj/horovod/common/operations.cc:655] [0]: Processing 1 tensors
[2021-05-21 09:27:04.188005: T /tmp/pip-req-build-rxees3rj/horovod/common/ops/nccl_operations.cc:593] [NCCLAlltoall::Execute start, communicator_id=]: 2

(I've addede the NCCL messages myself).

andrevitorelli commented 3 years ago

Thanks! I'll keep looking into it. I've eliminated the possibility of it being a node-specific issue. For me, about 70% of the runs fail. A few never return anything, but most get into that deadlock.

andrevitorelli commented 3 years ago

Further possibly related insights: https://github.com/horovod/horovod/issues/403

andrevitorelli commented 3 years ago

What does not work to solve this bug:

What seems to:

Current hypotheses:

kimchitsigai commented 3 years ago

@EiffL @andrevitorelli I've tried the 2 modifications below.

Test results (with module load tensorflow-gpu/py3/2.4.1+cuda-11.2 nvidia-nsight-systems/2021.1.1):

Attempt to explain the 4 node / 16 GPU problem. This is a representation of the 16 processes and the 2 sub-communicators, as seen in the logs:

process/global_rank   0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15

devices[comm_id=1]    0  0  0  0  4  4  4  4  8  8  8  8  12 12 12 12
                      1  1  1  1  5  5  5  5  9  9  9  9  13 13 13 13
                      2  2  2  2  6  6  6  6  10 10 10 10 14 14 14 14
                      3  3  3  3  7  7  7  7  11 11 11 11 15 15 15 15

devices[comm_id=2]    0  1  2  3  0  1  2  3  0  1  2  3  0  1  2  3
                      4  5  6  7  4  5  6  7  4  5  6  7  4  5  6  7
                      8  9  10 11 8  9  10 11 8  9  10 11 8  9  10 11
                      12 13 14 15 12 13 14 15 12 13 14 15 12 13 14 15

Let's suppose that:

So, we get the "Shutting down Background thread loop" message only for processes 12,13,14,15,3,7,11. The other processes are stuck and the qdrep files are not generated.

kimchitsigai commented 3 years ago

At the end of fft_benchmark.py, replacing:

exit(0)

by

import horovod.tensorflow as hvd
hvd.shutdown()
time.sleep(5)
exit(0)

seems to give better results. The qdrep files are generated in the 4 above test cases. The jobs must still be scanceled.