I meet the same problem with https://github.com/pytorch/fairseq/issues/3308 when I try to running on a single machine with multiple GPU. I can run the project successfully with single gpu. However, fail to run with multi-gpus.
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/data/szc/anaconda3/envs/prompt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/data/szc/nmt/bert-nmt/train.py", line 282, in distributed_main
main(args, init_distributed=True)
File "/data/szc/nmt/bert-nmt/train.py", line 95, in main
train(args, trainer, task, epoch_itr)
File "/data/szc/nmt/bert-nmt/train.py", line 138, in train
log_output = trainer.train_step(samples)
File "/data/szc/nmt/bert-nmt/fairseq/trainer.py", line 308, in train_step
all(norm == prev_norms[0] for norm in prev_norms)
File "/data/szc/nmt/bert-nmt/fairseq/trainer.py", line 308, in
all(norm == prev_norms[0] for norm in prev_norms)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
I change the following fairseq/trainer.py:
logging_outputs = list(chain.from_iterable(logging_outputs))
sample_sizes = list(chain.from_iterable(sample_sizes))
ooms = sum(ooms)
assert (
all(norm == prev_norms[0] for norm in prev_norms)
or all(math.isnan(norm) or math.isinf(norm) for norm in prev_norms)
), 'Fatal error: gradients are inconsistent between workers'
into
logging_outputs = list(chain.from_iterable(logging_outputs))
sample_sizes = list(chain.from_iterable(sample_sizes))
ooms = sum(ooms)
# assert (
# all(norm == prev_norms[0] for norm in prev_norms)
# or all(math.isnan(norm) or math.isinf(norm) for norm in prev_norms)
# ), 'Fatal error: gradients are inconsistent between workers'
I note the assert part, the code can run successfully. However, the following warning shows:
`UserWarning: The check_reduction argument in DistributedDataParallel module is deprecated. Please avoid using it. The check_reduction argument in DistributedDataParallel
I don't know whether this change would influence the final training and results?
python version 3.7.6
fairseq version 0.9.0
CUDA Version: 11.0
pyTorch VERSION: 1.7.1
I meet the same problem with https://github.com/pytorch/fairseq/issues/3308 when I try to running on a single machine with multiple GPU. I can run the project successfully with single gpu. However, fail to run with multi-gpus.
I change the following fairseq/trainer.py:
into
I note the assert part, the code can run successfully. However, the following warning shows:
`UserWarning: The check_reduction argument in DistributedDataParallel module is deprecated. Please avoid using it. The check_reduction argument in DistributedDataParallel
I don't know whether this change would influence the final training and results?
python version 3.7.6 fairseq version 0.9.0 CUDA Version: 11.0 pyTorch VERSION: 1.7.1