foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
https://pytorch.org/docs/stable/fsdp.html
Apache License 2.0
116 stars 18 forks source link

Nccl timeout #19

Closed lchu-ibm closed 4 months ago

lchu-ibm commented 4 months ago

When init a large model (e.g. 70b), the model creation + init + post-init (reset_parameters()) could take more than 30mins, which will lead to nccl timeout when doing low_cpu_mode as non-0-rank gpus will have to wait for more than 30 mins for rank0 (which exceed default nccl timeout).

We should increase this from 30 mins to 60 mins as what we did before.