bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 488 forks source link

launcher: join workers as they exit #429

Closed pleasantrabbit closed 2 years ago

pleasantrabbit commented 2 years ago

check worker exit status in the order they exit. This way failed workers can be discovered early, and the entire job terminated as soon as possible.

Signed-off-by: yulu.jia yulu.jia@bytedance.com