Open RaymondLi0 opened 1 year ago
When launching very long training runs, building the index mappings can take more than 1 minute. The consequence is that the other ranks will timeout. https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/training.py#L962 However the timeout passed to torch.distributed.initialize is 10 mins. Why isn't this value used in torch.distributed.broadcast?
torch.distributed.initialize
torch.distributed.broadcast
The workaround for now is to first create the index mappings on a single worker, as a preliminary run.
When launching very long training runs, building the index mappings can take more than 1 minute. The consequence is that the other ranks will timeout. https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/training.py#L962 However the timeout passed to
torch.distributed.initialize
is 10 mins. Why isn't this value used intorch.distributed.broadcast
?The workaround for now is to first create the index mappings on a single worker, as a preliminary run.