bigcode-project / Megatron-LM

Ongoing research training transformer models at scale
Other
376 stars 49 forks source link

Timeout on creating the index mappings #15

Open RaymondLi0 opened 1 year ago

RaymondLi0 commented 1 year ago

When launching very long training runs, building the index mappings can take more than 1 minute. The consequence is that the other ranks will timeout. https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/training.py#L962 However the timeout passed to torch.distributed.initialize is 10 mins. Why isn't this value used in torch.distributed.broadcast?

The workaround for now is to first create the index mappings on a single worker, as a preliminary run.