aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
177 stars 74 forks source link

FSDP Training Job failing on Validation Step (Batch 500) #340

Open nghtm opened 4 months ago

nghtm commented 4 months ago

Running the script 3.test_cases/10.FSDP/1.distributed-training.sbatch on 2 p5 nodes, and the job is failing at validation step after 500 batches.

slurm-47.log

0: OSError: [Errno 12] Cannot allocate memory

Configuration: SageMaker HyperPod - 2x P5 nodes

Ubuntu 20.04 DLAMI, NCCL version 2.19.4+cuda12.1

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 30 days with no activity.