aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
180 stars 74 forks source link

Enable autoresume for all Slurm examples #232

Closed sean-smith closed 1 month ago

sean-smith commented 6 months ago

We should add the following snippet to all Slurm examples so that if it's a hyperpod cluster it'll automatically add the --auto-resume=1 flag. This needs to be tested for all examples, see https://github.com/aws-samples/awsome-distributed-training/pull/231 for an example.

AUTO_RESUME=""
if [ -d "/opt/sagemaker_cluster" ]; then
    echo "Detected Hyperpod cluster.. enabling --auto-resume=1" 
    AUTO_RESUME="--auto-resume=1"
fi

srun ${AUTO_RESUME}
github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 14 days since being marked as stale.