🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
When init a large model (e.g. 70b), the model creation + init + post-init (reset_parameters()) could take more than 30mins, which will lead to nccl timeout when doing low_cpu_mode as non-0-rank gpus will have to wait for more than 30 mins for rank0 (which exceed default nccl timeout).
We should increase this from 30 mins to 60 mins as what we did before.
When init a large model (e.g. 70b), the model creation + init + post-init (reset_parameters()) could take more than 30mins, which will lead to nccl timeout when doing
low_cpu_mode
as non-0-rank gpus will have to wait for more than 30 mins for rank0 (which exceed default nccl timeout).We should increase this from 30 mins to 60 mins as what we did before.