Open nghtm opened 4 months ago
Running the script 3.test_cases/10.FSDP/1.distributed-training.sbatch on 2 p5 nodes, and the job is failing at validation step after 500 batches.
slurm-47.log
0: OSError: [Errno 12] Cannot allocate memory
Configuration: SageMaker HyperPod - 2x P5 nodes
Ubuntu 20.04 DLAMI, NCCL version 2.19.4+cuda12.1
This issue is stale because it has been open for 30 days with no activity.
Running the script 3.test_cases/10.FSDP/1.distributed-training.sbatch on 2 p5 nodes, and the job is failing at validation step after 500 batches.
slurm-47.log
Configuration: SageMaker HyperPod - 2x P5 nodes
Ubuntu 20.04 DLAMI, NCCL version 2.19.4+cuda12.1