Closed gakolhe closed 1 year ago
What is the exact command you are running?
Also, can you check whether the following issue https://github.com/facebookresearch/dlrm/issues/231 answers your question.
There have been no response for several months. I'm assuming this is resolved. Closing.
I get the issue with terabyte dataset while it is nonexistent on Kaggle. The issue is that Rank 7 always get stuck when testing at 372000th iteration during training. It is not able to pass the barrier. I am running this experiment with 8xA100s.
To debug, I have used following options NCCL_DEBUG = INFO NCCL_DEBUG_SUBSYS=ALL TORCH_DISTRIBUTED_DEBUG=DETAIL TORCH_CPP_LOG_LEVEL=INFO CUDA_LAUNCH_BLOCKING=1
However, I haven't been able to find something that raises the suspicion.
Below is the snapshot of the error message. I have appended the entire error message as a log file. This is while executing script from run_and_time.sh with Pytorch distributed.