THUNLP-MT / THUMT

An open-source neural machine translation toolkit developed by Tsinghua Natural Language Processing Group
BSD 3-Clause "New" or "Revised" License
701 stars 197 forks source link

multiple GPUs training with pytorch #86

Closed jennifer1995 closed 4 years ago

jennifer1995 commented 4 years ago

Hello. I want to use two gpus to train the model (pytorch 1.3.1 and cuda 9.2) with the command like below

python trainer.py --input corpus.32k.en.train corpus.32k.zh.train --vocabulary vocab.32k.en.txt vocab.32k.zh.txt --model transformer --validation corpus.32k.en.validation --references corpus.zh.validation --parameters=batch_size=6250,device_list=[5,6],update_cycle=2,train_steps=200000

but it comes the error:

Traceback (most recent call last): File "/home2/yy/port22/anaconda2/envs/py36tfmpt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home2/yy/port22/THUMT_en_zh/thumt/bin/trainer.py", line 456, in process_fn main(local_args) File "/home2/yy/port22/THUMT_en_zh/thumt/bin/trainer.py", line 410, in main loss = train_fn(features) File "/home2/yy/port22/THUMT_en_zh/thumt/bin/trainer.py", line 392, in train_fn loss = model(features, labels) File "/home2/yy/port22/anaconda2/envs/py36tfmpt/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "../../thumt/models/transformer.py", line 304, in forward state = self.encode(features, state) File "../../thumt/models/transformer.py", line 253, in encode inputs = torch.nn.functional.embedding(src_seq, self.src_embedding) File "/home2/yy/port22/anaconda2/envs/py36tfmpt/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1573049305765/work/aten/src/THC/generic/THCTensorIndex.cu:400

I'm sure that gpu5 and gpu6 is avaliable. What's the problem? Thank you very much.

GrittyChen commented 4 years ago

@jennifer1995 You can run the program as below: export CUDA_VISIBLE_DEVICES=5,6 python trainer.py --input corpus.32k.en.train corpus.32k.zh.train --vocabulary vocab.32k.en.txt vocab.32k.zh.txt --model transformer --validation corpus.32k.en.validation --references corpus.zh.validation --parameters=batch_size=6250,device_list=[0,1],update_cycle=2,train_steps=200000

jennifer1995 commented 4 years ago

@jennifer1995 You can run the program as below: export CUDA_VISIBLE_DEVICES=5,6 python trainer.py --input corpus.32k.en.train corpus.32k.zh.train --vocabulary vocab.32k.en.txt vocab.32k.zh.txt --model transformer --validation corpus.32k.en.validation --references corpus.zh.validation --parameters=batch_size=6250,device_list=[0,1],update_cycle=2,train_steps=200000

OK, I will try it. Thank you.

Felixgithub2017 commented 4 years ago

@jennifer1995 You can run the program as below: export CUDA_VISIBLE_DEVICES=5,6 python trainer.py --input corpus.32k.en.train corpus.32k.zh.train --vocabulary vocab.32k.en.txt vocab.32k.zh.txt --model transformer --validation corpus.32k.en.validation --references corpus.zh.validation --parameters=batch_size=6250,device_list=[0,1],update_cycle=2,train_steps=200000

Hi GrittyChen, I got the same errors: 'arguments are located on different GPUs'

I have two GPUs, the device para was set as 'device_list=[0,1]'. I tried export CUDA_VISIBLE_DEVICES=0,1, however the error was still there. Any other suggestions?

GrittyChen commented 4 years ago

@jennifer1995 I am sorry that I did not locate the cause of this error and did not reproduce it in my own environment. The best I can suggest is that you may be able to change the pytorch version and the version of cuda. You can create a new virtual environment and try again. My virtual environment is configured with python 3.6, pytorch 1.3.0, and cuda 10.0. Best wishes!

Wenda302 commented 4 years ago

Hi GrittyChen, I am having the same issue:

File "/home/wenda/THUMT_py/THUMT/thumt/models/transformer.py", line 253, in encode inputs = torch.nn.functional.embedding(src_seq, self.src_embedding) File "/home/wenda/anaconda2/envs/python36/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/THC/generic/THCTensorIndex.cu:400

My environment is configured with pytorch 1.3.1/1.4.0, cuda 10.2 and python 3.6.

Glaceon31 commented 4 years ago

@liwenda1990 Do you install pytorch with conda? Make sure you use the command "conda install pytorch=1.3.1 -c pytorch" when installing pytorch with conda.

Wenda302 commented 4 years ago

@Glaceon31 I am afraid that the error persists. Here are the relevant package info from 'conda list':

cudatoolkit               10.0.130                      0
pip                       20.0.2                   pypi_0    pypi
python                    3.6.10               h0371630_0  
pytorch                   1.3.1           py3.6_cuda10.0.130_cudnn7.6.3_0    pytorch
torch                     1.3.1                    pypi_0    pypi
torchvision               0.4.2                py36_cu100    pytorch  
Wenda302 commented 4 years ago

I dug into the problem a little bit. The issue seems to be caused by the default device being forgot (mysteriously) when executing dataset = data.get_dataset(params.input, "train", params) (about line 362, trainer.py). The default device for each process was initially set by torch.cuda.set_device(params.device_list[args.local_rank]) (about line 317, trainer.py), so a quick workaround is to reset the default device after 'data.get_dataset': I inserted

    if args.distributed:
        torch.cuda.set_device(args.local_rank)
        torch.set_default_tensor_type(torch.cuda.FloatTensor)
    else:
        torch.cuda.set_device(params.device_list[args.local_rank])
        torch.set_default_tensor_type(torch.cuda.FloatTensor)

at about line 402 in trainer.py (before 'while True').

The code seems working now, but why 'data.get_dataset' caused the problem may need further investigation.