Multinode training - Githubissues

Hey, thanks for your awesome project! I want to run some multi-node training with the following setup:

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

# Get the list of node names
nodes=($(scontrol show hostnames $SLURM_JOB_NODELIST))
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# Set environment variables for distributed training
MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
MASTER_PORT=29501
WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
RANK=$SLURM_PROCID
LOCAL_RANK=$SLURM_LOCALID

export MASTER_ADDR
export MASTER_PORT
export WORLD_SIZE
export RANK
export LOCAL_RANK

echo "Node IP: $head_node_ip"
export LOGLEVEL=INFO

srun torchrun \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=$SLURM_NTASKS_PER_NODE \
    --rdzv_id=$RANDOM \
    --rdzv_backend=c10d \
    --rdzv_conf=timeout=9000 \
    --rdzv_endpoint=$head_node_ip:$MASTER_PORT \
     scripts/pretrain.py
     ....

I'm running into issues like: Duplicate GPU detected : rank 2 and rank 10 both on CUDA device 50000 Could you share the setup for multinode training that works for you?

TRI-ML / prismatic-vlms

Multinode training #39