Closed Caozhou1995 closed 3 months ago
In multi-node scene, If some nodes error, the torchrun is not quitting immediately which have to wait until the timeout to be forced to kill. In this PR, we kill task automatically by querying the task status if changed from running to transitional.
In multi-node scene, If some nodes error, the torchrun is not quitting immediately which have to wait until the timeout to be forced to kill. In this PR, we kill task automatically by querying the task status if changed from running to transitional.