NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

[BUG] Rank worldsize mismatch prevents tensorboard from being set #872

Closed zainsarwar865 closed 6 days ago

zainsarwar865 commented 1 week ago

In the global_vars.py file, the _set_tensorboard_writer function only gets set if the rank == world size - 1. However, this can be an issue if ranks are local and world size > number of gpus on one node.

Thus, I propose setting it to the master node.