add Rank0-only profiler

foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

https://pytorch.org/docs/stable/fsdp.html

Apache License 2.0

114 stars 18 forks source link

add Rank0-only profiler #58

Closed lchu-ibm closed 3 months ago

lchu-ibm commented 3 months ago

With the current profiler, each GPU will write its own trace. This can be sometimes unnecessary/unwanted, as one might want to avoid writing 1024 traces (each hundreds of MBs) to the same shared location at the same time.

We should provide a new flag on whether to write profiler trace from rank0 gpu only. This should serve most of the cases.