VITA-Group / GNT

[ICLR 2023] "Is Attention All NeRF Needs?" by Mukund Varma T*, Peihao Wang* , Xuxi Chen, Tianlong Chen, Subhashini Venugopalan, Zhangyang Wang
https://vita-group.github.io/GNT
MIT License
340 stars 25 forks source link

How to apply multi-gpu training? #13

Closed lifuguan closed 1 year ago

lifuguan commented 1 year ago

Hello, thanks for the great work! I'm wondering how can we apply mulit-gpu training?

I use the following command

python train.py --config configs/gnt_ft_rffr.txt --distributed --local_rank 2

but it occurs the following problems:

Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

The distributed training code of train.py is shown below:

    if args.distributed:
        torch.distributed.init_process_group(backend="nccl", init_method="env://localhost:50000")
        args.local_rank = int(os.environ.get("LOCAL_RANK"))
        torch.cuda.set_device(args.local_rank)
MukundVarmaT commented 1 year ago

Hi,

Thank you for your interest in our work! To train on multiple GPUs,

python3 -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=<num-gpus> --use_env --master_port=21221 train.py ... (remaining args)

lifuguan commented 1 year ago

Thanks! One more question following the issue, if I train the model with 8 gpus, should I change N_rand from 4096 to 512?

MukundVarmaT commented 1 year ago

Yes, thats correct!