Timeout on creating the index mappings

When launching very long training runs, building the index mappings can take more than 1 minute. The consequence is that the other ranks will timeout. https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/training.py#L962 However the timeout passed to torch.distributed.initialize is 10 mins. Why isn't this value used in torch.distributed.broadcast?

The workaround for now is to first create the index mappings on a single worker, as a preliminary run.

bigcode-project / Megatron-LM