Open lq-blackcat opened 5 months ago
Very quick. If you stuck in this process, usually there's a mistake in your script.
source ~/.bashrc
export CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 train.py
What needs to be modified? Could you please provide some help. @beichenzbc
!/bin/bash #SBATCH --job-name=long-clip #SBATCH --nodes=1 #SBATCH --ntasks=32 #SBATCH --gres=gpu:1 #SBATCH --time=96:00:00 #SBATCH --comment pris718bobo
source ~/.bashrc
export CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 train.py
What needs to be modified? Could you please provide some help. @beichenzbc
Do you resolve this problem? I also get into the same case.
How long does distributed training initialization take? dist.init_process_group( backend=backend, world_size=world_size, rank=rank, )