Hey, thanks for your awesome project! I want to run some multi-node training with the following setup:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
# Get the list of node names
nodes=($(scontrol show hostnames $SLURM_JOB_NODELIST))
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
# Set environment variables for distributed training
MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
MASTER_PORT=29501
WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
RANK=$SLURM_PROCID
LOCAL_RANK=$SLURM_LOCALID
export MASTER_ADDR
export MASTER_PORT
export WORLD_SIZE
export RANK
export LOCAL_RANK
echo "Node IP: $head_node_ip"
export LOGLEVEL=INFO
srun torchrun \
--nnodes=$SLURM_NNODES \
--nproc_per_node=$SLURM_NTASKS_PER_NODE \
--rdzv_id=$RANDOM \
--rdzv_backend=c10d \
--rdzv_conf=timeout=9000 \
--rdzv_endpoint=$head_node_ip:$MASTER_PORT \
scripts/pretrain.py
....
I'm running into issues like:
Duplicate GPU detected : rank 2 and rank 10 both on CUDA device 50000
Could you share the setup for multinode training that works for you?
Hey, thanks for your awesome project! I want to run some multi-node training with the following setup:
I'm running into issues like:
Duplicate GPU detected : rank 2 and rank 10 both on CUDA device 50000
Could you share the setup for multinode training that works for you?