OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.67k stars 2.24k forks source link

copy_attn causes RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select) #2596

Closed aaaallleen closed 1 week ago

aaaallleen commented 1 week ago

I tried training OpenNMT-py with the following configs, with copy_attn set to False, everything trains fine. But when I set copy_attn to True, it produces the following log

The config:

src_vocab: data/transform.vocab
tgt_vocab: data/transform.vocab
share_vocab: True
log_file: train.log
data:
    corpus1:
        path_src: data/corpus.zh
        path_src: data/corpus.aus
        lambda_align: data/corpus.align
    valid:
        path_src: data/valid.zh
        path_tgt: data/valid.aus
        transforms: [sentencepiece]

src_subword_model: with_bible/large.model
tgt_subword_model: with_bible/large.model
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0

copy_attn: True

save_data: ./checkpoints
save_model: transformer
keep_checkpoint: 100
save_checkpoint_steps: 5000
valid_steps: 5000
train_steps: 200000
average_decay: 0.0005
seed: 1234
report_every: 1000
valid_metrics:
 - BLEU
 - TER
early_stopping: 5
early_stopping_criteria: BLEU
scoring_debug: True
dump_preds: press

# Batching
bucket_size: 144
world_size: 1
gpu_ranks: [0]
num_workers: 2
batch_type: "tokens"
batch_size: 4096
valid_batch_size: 1024
accum_count: [4]
accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 2
warmup_steps: 16000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
position_encoding: true
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]

tensorboard: True
tensorboard_log_dir: runs

This is the error log it produces

Traceback (most recent call last):
  File "/volume/training-data-aus-zh/OpenNMT-py/onmt/trainer.py", line 508, in _gradient_accumulation
    loss, batch_stats = self.train_loss(
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/volume/training-data-aus-zh/OpenNMT-py/onmt/utils/loss.py", line 327, in forward
    scores_data = collapse_copy_scores(
  File "/volume/training-data-aus-zh/OpenNMT-py/onmt/modules/copy_generator.py", line 28, in collapse_copy_scores
    score.index_add_(1, fill, score.index_select(1, blank))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
Traceback (most recent call last):
  File "/volume/training-data-aus-zh/OpenNMT-py/train.py", line 6, in <module>
    main()
  File "/volume/training-data-aus-zh/OpenNMT-py/onmt/bin/train.py", line 67, in main
    train(opt)
  File "/volume/training-data-aus-zh/OpenNMT-py/onmt/bin/train.py", line 52, in train
    train_process(opt, device_id=0)
  File "/volume/training-data-aus-zh/OpenNMT-py/onmt/train_single.py", line 238, in main
    trainer.train(
  File "/volume/training-data-aus-zh/OpenNMT-py/onmt/trainer.py", line 319, in train
    self._gradient_accumulation(
  File "/volume/training-data-aus-zh/OpenNMT-py/onmt/trainer.py", line 535, in _gradient_accumulation
    raise exc
  File "/volume/training-data-aus-zh/OpenNMT-py/onmt/trainer.py", line 508, in _gradient_accumulation
    loss, batch_stats = self.train_loss(
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/volume/training-data-aus-zh/OpenNMT-py/onmt/utils/loss.py", line 327, in forward
    scores_data = collapse_copy_scores(
  File "/volume/training-data-aus-zh/OpenNMT-py/onmt/modules/copy_generator.py", line 28, in collapse_copy_scores
    score.index_add_(1, fill, score.index_select(1, blank))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

I tried looking into the code but the model.to(device) was performed. Any reason why this could be happening? Thank you.

aaaallleen commented 1 week ago

I found a fix to this, sent the tensors blank, fill and score to device will fix this. Change OpenNMT-py/onmt/modules/copy_generator.py line 24 to 30 to:

 if blank:
            blank = torch.Tensor(blank).to(torch.int64).to(scores.device)
            fill = torch.Tensor(fill).to(torch.int64).to(scores.device)
            score = scores[:, b] if batch_dim == 1 else scores[b]
            score = score.to(score.device)
            score.index_add_(1, fill, score.index_select(1, blank))
            score.index_fill_(1, blank, 1e-10)

This should fix the issue, I feel like this is a temporary fix and is not the best solution. Should I open a PR for this issue?

vince62s commented 1 week ago

Please read the README of the project, we are no longer supporting OpenNMT-py and switching to https://github.com/eole-nlp/eole However bear in mind that we dropped copy attention in EOLE, it does not bring improvement especially with transformers. I suggest you to switch to eole if you intend to get support in the future.

aaaallleen commented 1 week ago

Oh, thank you. I noticed that the performance didn't improve, after fixing the issue. Thank you for your work!