We should add the following snippet to all Slurm examples so that if it's a hyperpod cluster it'll automatically add the --auto-resume=1 flag. This needs to be tested for all examples, see https://github.com/aws-samples/awsome-distributed-training/pull/231 for an example.
AUTO_RESUME=""
if [ -d "/opt/sagemaker_cluster" ]; then
echo "Detected Hyperpod cluster.. enabling --auto-resume=1"
AUTO_RESUME="--auto-resume=1"
fi
srun ${AUTO_RESUME}
We should add the following snippet to all Slurm examples so that if it's a hyperpod cluster it'll automatically add the
--auto-resume=1
flag. This needs to be tested for all examples, see https://github.com/aws-samples/awsome-distributed-training/pull/231 for an example.