distributed_launch doesn't work with Terabyte dataset.

I get the issue with terabyte dataset while it is nonexistent on Kaggle. The issue is that Rank 7 always get stuck when testing at 372000th iteration during training. It is not able to pass the barrier. I am running this experiment with 8xA100s.

To debug, I have used following options NCCL_DEBUG = INFO NCCL_DEBUG_SUBSYS=ALL TORCH_DISTRIBUTED_DEBUG=DETAIL TORCH_CPP_LOG_LEVEL=INFO CUDA_LAUNCH_BLOCKING=1

However, I haven't been able to find something that raises the suspicion.

Below is the snapshot of the error message. I have appended the entire error message as a log file. This is while executing script from run_and_time.sh with Pytorch distributed.

[E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=512012, OpType=ALLTOALL_BASE, Timeout(ms)=1800000) ran for 1803059 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=512012, OpType=ALLTOALL_BASE, Timeout(ms)=1800000) ran for 1803212 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=512012, OpType=ALLTOALL_BASE, Timeout(ms)=1800000) ran for 1803234 milliseconds before timing out.
667b0d9aff664a0e90a388e56e93a685000000:112:276 [2] NCCL INFO [Service thread] Connection closed by localRank 2
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=512012, OpType=ALLTOALL_BASE, Timeout(ms)=1800000) ran for 1803291 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=512012, OpType=ALLTOALL_BASE, Timeout(ms)=1800000) ran for 1803322 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=512012, OpType=ALLTOALL_BASE, Timeout(ms)=1800000) ran for 1803338 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=512012, OpType=ALLTOALL_BASE, Timeout(ms)=1800000) ran for 1803346 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=512012, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803353 milliseconds before timing out.
667b0d9aff664a0e90a388e56e93a685000000:110:275 [0] NCCL INFO [Service thread] Connection closed by localRank 0
667b0d9aff664a0e90a388e56e93a685000000:115:274 [5] NCCL INFO [Service thread] Connection closed by localRank 5
667b0d9aff664a0e90a388e56e93a685000000:114:273 [4] NCCL INFO [Service thread] Connection closed by localRank 4
667b0d9aff664a0e90a388e56e93a685000000:112:168 [0] NCCL INFO comm 0x5642a7f6d3d0 rank 2 nranks 8 cudaDev 2 busId 300000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
667b0d9aff664a0e90a388e56e93a685000000:110:177 [0] NCCL INFO comm 0x5555a2fb7970 rank 0 nranks 8 cudaDev 0 busId 100000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
667b0d9aff664a0e90a388e56e93a685000000:115:171 [0] NCCL INFO comm 0x55b3b87c05b0 rank 5 nranks 8 cudaDev 5 busId c00000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
667b0d9aff664a0e90a388e56e93a685000000:114:183 [0] NCCL INFO comm 0x556914c016b0 rank 4 nranks 8 cudaDev 4 busId b00000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
667b0d9aff664a0e90a388e56e93a685000000:116:174 [0] NCCL INFO comm 0x5567fcc80eb0 rank 6 nranks 8 cudaDev 6 busId d00000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
667b0d9aff664a0e90a388e56e93a685000000:111:189 [0] NCCL INFO comm 0x55f9f9511cd0 rank 1 nranks 8 cudaDev 1 busId 200000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
667b0d9aff664a0e90a388e56e93a685000000:113:186 [0] NCCL INFO comm 0x56060c435100 rank 3 nranks 8 cudaDev 3 busId 400000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
667b0d9aff664a0e90a388e56e93a685000000:117:180 [0] NCCL INFO comm 0x55a67e095e10 rank 7 nranks 8 cudaDev 7 busId e00000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 113 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 115 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 110) of binary: /opt/conda/envs/ptca/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/ptca/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.0', 'console_scripts', 'torchrun')())
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
dlrm_s_pytorch.py FAILED
------------------------------------------------------------

facebookresearch / dlrm

distributed_launch doesn't work with Terabyte dataset. #292