foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
https://pytorch.org/docs/stable/fsdp.html
Apache License 2.0
116 stars 18 forks source link

optimize profiler trace generation #37

Closed lchu-ibm closed 4 months ago

lchu-ibm commented 4 months ago

By default, trace will be generated per each gpu. In a multi-gpu setup, this means we are adding many threads concurrently to the gpus during the steps that traces are generated. This might cause those steps being slower than expected, thus making those step numbers un-realible.

In most cases this is fine but in some scenarios the captured steps will be significantly slower than normal due to the overhead-ed threads that being added to the steps.

We should revisit the trace generation part in the code and make it generate trace only on rank0 to avoid this.