Closed jennifer1995 closed 4 years ago
@jennifer1995 You can run the program as below:
export CUDA_VISIBLE_DEVICES=5,6
python trainer.py --input corpus.32k.en.train corpus.32k.zh.train --vocabulary vocab.32k.en.txt vocab.32k.zh.txt --model transformer --validation corpus.32k.en.validation --references corpus.zh.validation --parameters=batch_size=6250,device_list=[0,1],update_cycle=2,train_steps=200000
@jennifer1995 You can run the program as below:
export CUDA_VISIBLE_DEVICES=5,6
python trainer.py --input corpus.32k.en.train corpus.32k.zh.train --vocabulary vocab.32k.en.txt vocab.32k.zh.txt --model transformer --validation corpus.32k.en.validation --references corpus.zh.validation --parameters=batch_size=6250,device_list=[0,1],update_cycle=2,train_steps=200000
OK, I will try it. Thank you.
@jennifer1995 You can run the program as below:
export CUDA_VISIBLE_DEVICES=5,6
python trainer.py --input corpus.32k.en.train corpus.32k.zh.train --vocabulary vocab.32k.en.txt vocab.32k.zh.txt --model transformer --validation corpus.32k.en.validation --references corpus.zh.validation --parameters=batch_size=6250,device_list=[0,1],update_cycle=2,train_steps=200000
Hi GrittyChen, I got the same errors: 'arguments are located on different GPUs'
I have two GPUs, the device para was set as 'device_list=[0,1]'. I tried export CUDA_VISIBLE_DEVICES=0,1
, however the error was still there. Any other suggestions?
@jennifer1995 I am sorry that I did not locate the cause of this error and did not reproduce it in my own environment. The best I can suggest is that you may be able to change the pytorch version and the version of cuda. You can create a new virtual environment and try again. My virtual environment is configured with python 3.6, pytorch 1.3.0, and cuda 10.0. Best wishes!
Hi GrittyChen, I am having the same issue:
File "/home/wenda/THUMT_py/THUMT/thumt/models/transformer.py", line 253, in encode inputs = torch.nn.functional.embedding(src_seq, self.src_embedding) File "/home/wenda/anaconda2/envs/python36/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/THC/generic/THCTensorIndex.cu:400
My environment is configured with pytorch 1.3.1/1.4.0, cuda 10.2 and python 3.6.
@liwenda1990 Do you install pytorch with conda? Make sure you use the command "conda install pytorch=1.3.1 -c pytorch" when installing pytorch with conda.
@Glaceon31 I am afraid that the error persists. Here are the relevant package info from 'conda list':
cudatoolkit 10.0.130 0
pip 20.0.2 pypi_0 pypi
python 3.6.10 h0371630_0
pytorch 1.3.1 py3.6_cuda10.0.130_cudnn7.6.3_0 pytorch
torch 1.3.1 pypi_0 pypi
torchvision 0.4.2 py36_cu100 pytorch
I dug into the problem a little bit. The issue seems to be caused by the default device being forgot (mysteriously) when executing
dataset = data.get_dataset(params.input, "train", params)
(about line 362, trainer.py).
The default device for each process was initially set by
torch.cuda.set_device(params.device_list[args.local_rank])
(about line 317, trainer.py),
so a quick workaround is to reset the default device after 'data.get_dataset': I inserted
if args.distributed:
torch.cuda.set_device(args.local_rank)
torch.set_default_tensor_type(torch.cuda.FloatTensor)
else:
torch.cuda.set_device(params.device_list[args.local_rank])
torch.set_default_tensor_type(torch.cuda.FloatTensor)
at about line 402 in trainer.py (before 'while True').
The code seems working now, but why 'data.get_dataset' caused the problem may need further investigation.
Hello. I want to use two gpus to train the model
(pytorch 1.3.1 and cuda 9.2)
with the command like belowpython trainer.py --input corpus.32k.en.train corpus.32k.zh.train --vocabulary vocab.32k.en.txt vocab.32k.zh.txt --model transformer --validation corpus.32k.en.validation --references corpus.zh.validation --parameters=batch_size=6250,device_list=[5,6],update_cycle=2,train_steps=200000
but it comes the error:
Traceback (most recent call last): File "/home2/yy/port22/anaconda2/envs/py36tfmpt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home2/yy/port22/THUMT_en_zh/thumt/bin/trainer.py", line 456, in process_fn main(local_args) File "/home2/yy/port22/THUMT_en_zh/thumt/bin/trainer.py", line 410, in main loss = train_fn(features) File "/home2/yy/port22/THUMT_en_zh/thumt/bin/trainer.py", line 392, in train_fn loss = model(features, labels) File "/home2/yy/port22/anaconda2/envs/py36tfmpt/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "../../thumt/models/transformer.py", line 304, in forward state = self.encode(features, state) File "../../thumt/models/transformer.py", line 253, in encode inputs = torch.nn.functional.embedding(src_seq, self.src_embedding) File "/home2/yy/port22/anaconda2/envs/py36tfmpt/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1573049305765/work/aten/src/THC/generic/THCTensorIndex.cu:400
I'm sure that
gpu5
andgpu6
is avaliable. What's the problem? Thank you very much.