optimize profiler trace generation

By default, trace will be generated per each gpu. In a multi-gpu setup, this means we are adding many threads concurrently to the gpus during the steps that traces are generated. This might cause those steps being slower than expected, thus making those step numbers un-realible.

In most cases this is fine but in some scenarios the captured steps will be significantly slower than normal due to the overhead-ed threads that being added to the steps.

We should revisit the trace generation part in the code and make it generate trace only on rank0 to avoid this.

foundation-model-stack / fms-fsdp

optimize profiler trace generation #37