aws-samples / amazon-eks-machine-learning-with-terraform-and-kubeflow

Distributed training using Kubeflow on Amazon EKS
Apache License 2.0
82 stars 42 forks source link

Worker restart policy policy should be Never #26

Closed ajayvohra2005 closed 2 years ago

ajayvohra2005 commented 2 years ago

Worker restartPolicy policy in the MPIJob specification should be set to Never so worker replica does not restart on error. The training job must fail if a Worker replica encounters an error.

ajayvohra2005 commented 2 years ago

Hash commit 16563f62ae0b19831563cd0253a6d723170d12ba resolves this isssue.