How to apply multi-gpu training?

lifuguan commented 1 year ago

Hello, thanks for the great work! I'm wondering how can we apply mulit-gpu training?

I use the following command

python train.py --config configs/gnt_ft_rffr.txt --distributed --local_rank 2

but it occurs the following problems:

Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

The distributed training code of train.py is shown below:

    if args.distributed:
        torch.distributed.init_process_group(backend="nccl", init_method="env://localhost:50000")
        args.local_rank = int(os.environ.get("LOCAL_RANK"))
        torch.cuda.set_device(args.local_rank)

MukundVarmaT commented 1 year ago

Hi,

Thank you for your interest in our work! To train on multiple GPUs,

python3 -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=<num-gpus> --use_env --master_port=21221 train.py ... (remaining args)

lifuguan commented 1 year ago

Thanks! One more question following the issue, if I train the model with 8 gpus, should I change N_rand from 4096 to 512?

MukundVarmaT commented 1 year ago

Yes, thats correct!

VITA-Group / GNT

How to apply multi-gpu training? #13