facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.44k stars 6.4k forks source link

Run with previous version raise Found at least two devices Error #4360

Open zhaochen0110 opened 2 years ago

zhaochen0110 commented 2 years ago

I meet the same problem with https://github.com/pytorch/fairseq/issues/3308 when I try to running on a single machine with multiple GPU. I can run the project successfully with single gpu. However, fail to run with multi-gpus.

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/data/szc/anaconda3/envs/prompt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/data/szc/nmt/bert-nmt/train.py", line 282, in distributed_main main(args, init_distributed=True) File "/data/szc/nmt/bert-nmt/train.py", line 95, in main train(args, trainer, task, epoch_itr) File "/data/szc/nmt/bert-nmt/train.py", line 138, in train log_output = trainer.train_step(samples) File "/data/szc/nmt/bert-nmt/fairseq/trainer.py", line 308, in train_step all(norm == prev_norms[0] for norm in prev_norms) File "/data/szc/nmt/bert-nmt/fairseq/trainer.py", line 308, in all(norm == prev_norms[0] for norm in prev_norms) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

I change the following fairseq/trainer.py:

logging_outputs = list(chain.from_iterable(logging_outputs))
sample_sizes = list(chain.from_iterable(sample_sizes))
ooms = sum(ooms)
assert (
    all(norm == prev_norms[0] for norm in prev_norms)
    or all(math.isnan(norm) or math.isinf(norm) for norm in prev_norms)
), 'Fatal error: gradients are inconsistent between workers'

into

logging_outputs = list(chain.from_iterable(logging_outputs))
sample_sizes = list(chain.from_iterable(sample_sizes))
ooms = sum(ooms)
# assert (
#     all(norm == prev_norms[0] for norm in prev_norms)
#     or all(math.isnan(norm) or math.isinf(norm) for norm in prev_norms)
# ), 'Fatal error: gradients are inconsistent between workers'

I note the assert part, the code can run successfully. However, the following warning shows:

`UserWarning: The check_reduction argument in DistributedDataParallel module is deprecated. Please avoid using it. The check_reduction argument in DistributedDataParallel

I don't know whether this change would influence the final training and results?

python version 3.7.6 fairseq version 0.9.0 CUDA Version: 11.0 pyTorch VERSION: 1.7.1

dsj96 commented 2 years ago

I met the same problem as you #73 . Does this change would influence the final training and results?