Open erensezener opened 3 years ago
@erensezener Did you figure this out? :)
@gizacard Would you mind providing some instruction on this? which options should be set? Thanks
@gizacard I wanted to train using Multi-GPU (4 gpus) , and for that I used the local_rank=0
, and set following env variables:
RANK=0
NGPU=4
WORLD_SIZE=4
Although I am not aiming for slurs job the code here require me to set MASTER_ADDR
and MASTER_PORT
as well. why? Anyway I set them to be my server ip and a port.
After setting this parameters, when I run the code, the training never starts. Though without distributed_training (single gpu) it works fine.
Can you guide me if I am doing correct? Thanks
NGPU=<num of gpus in one node> python -m torch.distributed.launch --nproc_per_node=<num of gpus in one node> train_reader.py \
--use_checkpoint \
--lr 0.00005 \
--optim adamw \
--scheduler linear \
--weight_decay 0.01 \
--text_maxlength 250 \
--per_gpu_batch_size <bs> \
--n_context 100 \
--total_step 15000 \
--warmup_step 1000 \
--train_data open_domain_data/NQ/train.json \
--eval_data open_domain_data/NQ/dev.json \
--model_size base \
--name testing_base_model_nq \
--checkpoint_dir pretrained_models \
--accumulation_steps <steps>
Something like this worked for me
@fabrahman Could you provide an update on this issue? I have exactly the same issue, and I found that the code freezes without any error message after executing line 194 of train_reader.py
After setting this parameters, when I run the code, the training never starts. Though without distributed_training (single gpu) it works fine.
@Duemoo I also encountered this problem, using multiple gpu, I found that the code freezes without any error message after executing line 194 of train_reader.py, how did you solve it? Thanks!
I see that there is some code supporting multi-GPUs, eg here and here.
However, I don't see an option/flag to actually utilize distributed computing. Could you clarify?
Thank you.