foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
https://pytorch.org/docs/stable/fsdp.html
Apache License 2.0
162 stars 27 forks source link

increase nccl timeout #20

Closed lchu-ibm closed 7 months ago

lchu-ibm commented 7 months ago

When init a large model (e.g. 70b), the model creation + init + post-init (reset_parameters()) could take more than 30mins, which will lead to nccl timeout when doing low_cpu_mode as non-0-rank gpus will have to wait for more than 30 mins for rank0 (which exceed default nccl timeout).

We should increase this from 30 mins to 60 mins as what we did before.

lchu-ibm commented 7 months ago

@nairbv

model = Llama(70b) alone would take more than 10 mins. it is indeed slow on large model, which is sort of the reason I have went back and forth with between implementation 1 and 2 as discussed in https://github.com/foundation-model-stack/fms-fsdp/issues/6.

Davis and I will revisit a "post fsdp call way of doing post-init/reset-parameters" in the future which would boost the model perf as we will be able to use implementation 2.

nairbv commented 7 months ago

model = Llama(70b) alone would take more than 10 mins

is the timeout for time to sync parameters? I thought it was just a timeout for the nccl initialization / coordination