Distributed computing (eg, multi-GPU) support

facebookresearch / FiD

Fusion-in-Decoder

Other

550 stars 107 forks source link

Distributed computing (eg, multi-GPU) support #13

Open erensezener opened 3 years ago

erensezener commented 3 years ago

I see that there is some code supporting multi-GPUs, eg here and here.

However, I don't see an option/flag to actually utilize distributed computing. Could you clarify?

Thank you.

fabrahman commented 2 years ago

@erensezener Did you figure this out? :)

fabrahman commented 2 years ago

@gizacard Would you mind providing some instruction on this? which options should be set? Thanks

fabrahman commented 2 years ago

@gizacard I wanted to train using Multi-GPU (4 gpus) , and for that I used the local_rank=0, and set following env variables:

RANK=0
NGPU=4
WORLD_SIZE=4

Although I am not aiming for slurs job the code here require me to set MASTER_ADDR and MASTER_PORT as well. why? Anyway I set them to be my server ip and a port.

After setting this parameters, when I run the code, the training never starts. Though without distributed_training (single gpu) it works fine.

Can you guide me if I am doing correct? Thanks

gowtham1997 commented 2 years ago

NGPU=<num of gpus in one node> python -m torch.distributed.launch --nproc_per_node=<num of gpus in one node> train_reader.py \
        --use_checkpoint \
        --lr 0.00005 \
        --optim adamw \
        --scheduler linear \
        --weight_decay 0.01 \
        --text_maxlength 250 \
        --per_gpu_batch_size <bs> \
        --n_context 100 \
        --total_step 15000 \
        --warmup_step 1000 \
        --train_data open_domain_data/NQ/train.json \
        --eval_data open_domain_data/NQ/dev.json \
        --model_size base \
        --name testing_base_model_nq \
        --checkpoint_dir pretrained_models \
        --accumulation_steps <steps>

Something like this worked for me

Duemoo commented 1 year ago

@fabrahman Could you provide an update on this issue? I have exactly the same issue, and I found that the code freezes without any error message after executing line 194 of train_reader.py

After setting this parameters, when I run the code, the training never starts. Though without distributed_training (single gpu) it works fine.

szh-max commented 1 year ago

@Duemoo I also encountered this problem, using multiple gpu, I found that the code freezes without any error message after executing line 194 of train_reader.py， how did you solve it? Thanks!