QwenLM / Qwen2.5

Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud.
9.77k stars 609 forks source link

[Bug]: transformers[4.46.2] error in multi-gpu when training. #1070

Open tanypullin opened 1 week ago

tanypullin commented 1 week ago

Model Series

Qwen2.5

What are the models used?

Qwen2.5-7B

What is the scenario where the problem happened?

transformers

Is this a known issue?

Information about environment

OS: Ubuntu Python: Python3.10 GPUs: 2x NV 4090

Log output

File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 820, in forward
    return model_forward(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 808, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/peft/peft_model.py", line 1644, in forward
    return self.base_model(
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 197, in forward
    return self.model.forward(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1183, in forward
    loss = self.loss_function(logits, labels, self.vocab_size, **loss_kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 46, in ForCausalLMLoss
    loss = fixed_cross_entropy(shift_logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 28, in fixed_cross_entropy
    loss = loss / num_items_in_batch
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Description

Steps to reproduce

This happens to Qwen2.5-7B-Instruct The problem can be reproduced with the following steps:

  1. just peft training

Expected results

The results are expected to be training

Attempts to fix

Anything else helpful for investigation

downgrade transformers to 4.45.0 will work.

jklj077 commented 5 days ago

downgrade transformers to 4.45.0 will work.

looks like an issue with tranformers after the loss functions are reworked in 4.46.

for a hot fix, could you try edit this line

  File "/root/anaconda3/envs/newpy10/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 28, in fixed_cross_entropy
    loss = loss / num_items_in_batch

to

    loss = loss / torch.tensor(num_items_in_batch, device=loss.device)

or stay at transformers<4.46.0 until a proper fix is released.