Closed ahmet-uyar closed 4 years ago
I implemented automatic resubmission of MPI jobs in Kubernetes, standalone and slurm clusters. I improved handling of JobMaster failures in Kubernetes clusters.
I implemented automatic resubmission of MPI jobs in Kubernetes, standalone and slurm clusters. I improved handling of JobMaster failures in Kubernetes clusters.