Closed ajayvohra2005 closed 2 years ago
Worker restartPolicy policy in the MPIJob specification should be set to Never so worker replica does not restart on error. The training job must fail if a Worker replica encounters an error.
restartPolicy
MPIJob
Never
Worker
Hash commit 16563f62ae0b19831563cd0253a6d723170d12ba resolves this isssue.
Worker
restartPolicy
policy in theMPIJob
specification should be set toNever
so worker replica does not restart on error. The training job must fail if aWorker
replica encounters an error.