🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
With the current profiler, each GPU will write its own trace. This can be sometimes unnecessary/unwanted, as one might want to avoid writing 1024 traces (each hundreds of MBs) to the same shared location at the same time.
We should provide a new flag on whether to write profiler trace from rank0 gpu only. This should serve most of the cases.
With the current profiler, each GPU will write its own trace. This can be sometimes unnecessary/unwanted, as one might want to avoid writing 1024 traces (each hundreds of MBs) to the same shared location at the same time.
We should provide a new flag on whether to write profiler trace from rank0 gpu only. This should serve most of the cases.