FlagOpen / FlagScale

FlagScale is a large model toolkit based on open-sourced projects.
Other
132 stars 40 forks source link

[AutoTuner] Update multi-node scene #136

Closed Caozhou1995 closed 3 months ago

Caozhou1995 commented 3 months ago

In multi-node scene, If some nodes error, the torchrun is not quitting immediately which have to wait until the timeout to be forced to kill. In this PR, we kill task automatically by querying the task status if changed from running to transitional.