QwenLM / Qwen-VL

The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
Other
4.5k stars 338 forks source link

[BUG] <title> 多卡推理报错 RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:3! #137

Open iFe1er opened 9 months ago

iFe1er commented 9 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

我使用4张Tesla T4 显卡参考官方demo进行推理,由于单卡只有16G显存,因此需要进行多卡推理,但是报错:

报错信息:


RuntimeError Traceback (most recent call last)

in ~/.cache/huggingface/modules/transformers_modules/qwen-vl-chat/modeling_qwen.py in chat(self, tokenizer, query, history, system, append_history, stream, stop_words_ids, generation_config, **kwargs) 945 )) 946 input_ids = torch.tensor([context_tokens]).to('cuda:2') --> 947 outputs = self.generate( 948 input_ids, 949 stop_words_ids=stop_words_ids, ~/.cache/huggingface/modules/transformers_modules/qwen-vl-chat/modeling_qwen.py in generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, **kwargs) 1064 logits_processor.append(stop_words_logits_processor) 1065 -> 1066 return super().generate( 1067 inputs, 1068 generation_config=generation_config, /data/services/anaconda3/envs/franky_torchgpu_py38_clean/lib/python3.8/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs) 25 def decorate_context(*args, **kwargs): 26 with self.clone(): ---> 27 return func(*args, **kwargs) 28 return cast(F, decorate_context) 29 /data/services/anaconda3/envs/franky_torchgpu_py38_clean/lib/python3.8/site-packages/transformers/generation/utils.py in generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, **kwargs) 1586 1587 # 13. run sample -> 1588 return self.sample( 1589 input_ids, 1590 logits_processor=logits_processor, /data/services/anaconda3/envs/franky_torchgpu_py38_clean/lib/python3.8/site-packages/transformers/generation/utils.py in sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs) 2640 2641 # forward pass to get next token -> 2642 outputs = self( 2643 **model_inputs, 2644 return_dict=True, /data/services/anaconda3/envs/franky_torchgpu_py38_clean/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1129 or _global_forward_hooks or _global_forward_pre_hooks): -> 1130 return forward_call(*input, **kwargs) 1131 # Do not call functions when jit is used 1132 full_backward_hooks, non_full_backward_hooks = [], [] /data/services/anaconda3/envs/franky_torchgpu_py38_clean/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs) 163 output = old_forward(*args, **kwargs) 164 else: --> 165 output = old_forward(*args, **kwargs) 166 return module._hf_hook.post_forward(module, output) 167 ~/.cache/huggingface/modules/transformers_modules/qwen-vl-chat/modeling_qwen.py in forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, labels, use_cache, output_attentions, output_hidden_states, return_dict) 854 ) 855 --> 856 transformer_outputs = self.transformer( 857 input_ids, 858 past_key_values=past_key_values, /data/services/anaconda3/envs/franky_torchgpu_py38_clean/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1129 or _global_forward_hooks or _global_forward_pre_hooks): -> 1130 return forward_call(*input, **kwargs) 1131 # Do not call functions when jit is used 1132 full_backward_hooks, non_full_backward_hooks = [], [] ~/.cache/huggingface/modules/transformers_modules/qwen-vl-chat/modeling_qwen.py in forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions, output_hidden_states, return_dict) 563 images.append(bytes(image).decode('utf-8')) 564 --> 565 images = self.visual.encode(images) 566 assert images.shape[0] == len(images) 567 fake_images = None ~/.cache/huggingface/modules/transformers_modules/qwen-vl-chat/visual.py in encode(self, image_paths) 424 images.append(self.image_transform(image)) 425 images = torch.stack(images, dim=0) --> 426 return self(images) /data/services/anaconda3/envs/franky_torchgpu_py38_clean/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1129 or _global_forward_hooks or _global_forward_pre_hooks): -> 1130 return forward_call(*input, **kwargs) 1131 # Do not call functions when jit is used 1132 full_backward_hooks, non_full_backward_hooks = [], [] ~/.cache/huggingface/modules/transformers_modules/qwen-vl-chat/visual.py in forward(self, x) 405 406 x = x.permute(1, 0, 2) # NLD -> LND --> 407 x = self.transformer(x) 408 x = x.permute(1, 0, 2) # LND -> NLD 409 /data/services/anaconda3/envs/franky_torchgpu_py38_clean/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1129 or _global_forward_hooks or _global_forward_pre_hooks): -> 1130 return forward_call(*input, **kwargs) 1131 # Do not call functions when jit is used 1132 full_backward_hooks, non_full_backward_hooks = [], [] ~/.cache/huggingface/modules/transformers_modules/qwen-vl-chat/visual.py in forward(self, x, attn_mask) 326 def forward(self, x: torch.Tensor, attn_mask: Optional[torch.Tensor] = None): 327 for r in self.resblocks: --> 328 x = r(x, attn_mask=attn_mask) 329 return x 330 /data/services/anaconda3/envs/franky_torchgpu_py38_clean/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1129 or _global_forward_hooks or _global_forward_pre_hooks): -> 1130 return forward_call(*input, **kwargs) 1131 # Do not call functions when jit is used 1132 full_backward_hooks, non_full_backward_hooks = [], [] ~/.cache/huggingface/modules/transformers_modules/qwen-vl-chat/visual.py in forward(self, q_x, k_x, v_x, attn_mask) 294 295 x = q_x + self.attention(q_x=self.ln_1(q_x), k_x=k_x, v_x=v_x, attn_mask=attn_mask) --> 296 x = x + self.mlp(self.ln_2(x)) 297 return x 298 RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:3! ### 期望行为 | Expected Behavior _No response_ ### 复现方法 | Steps To Reproduce from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("/data/services/mining/llm/model_bin/qwen-vl-chat", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("/data/services/mining/llm/model_bin/qwen-vl-chat", trust_remote_code=True, device_map="auto", fp16=True).eval() model.generation_config = GenerationConfig.from_pretrained("/data/services/mining/llm/model_bin/qwen-vl-chat", trust_remote_code=True) #开始推理 query = tokenizer.from_list_format([ {'image': test_url}, {'text': '这幅图的内容是什么'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ### 运行环境 | Environment ```Markdown - OS: CentOS - Python:Python 3.8.8 - Transformers:4.13.0 - PyTorch:1.12.0 - CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 10.2 ``` ### 备注 | Anything else? _No response_
doraemon-plus commented 9 months ago

同遇到这个bug:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:3! 请问后来解决了吗

uyo9ko commented 9 months ago

change

 model = AutoPeftModelForCausalLM.from_pretrained(
    path to the output directory,
    device_map="auto",
    trust_remote_code=True
).eval()

to

model = AutoPeftModelForCausalLM.from_pretrained(
    path to the output directory,
    device_map="cuda",
    trust_remote_code=True
).eval()
599046587lz commented 7 months ago

请问解决了吗

cdqncn commented 7 months ago

请问解决了吗

Etpoem commented 7 months ago

可以只用三张卡试试,我使用 4 张12G 的推理也会报这个错,当我指定只使用其中三张卡时可以正常推理

cnahmgx commented 2 months ago

@Etpoem 代码共享一下呢,搞了一天了,搞不定

Etpoem commented 2 months ago

@Etpoem 代码共享一下呢,搞了一天了,搞不定

你使用的是什么类型的卡?几张? 我就是单纯设定只跑在三张12G显存的卡上,没有修改任何代码

1456416403 commented 1 month ago

@Etpoem 代码共享一下呢,搞了一天了,搞不定

你使用的是什么类型的卡?几张? 我就是单纯设定只跑在三张12G显存的卡上,没有修改任何代码

请问你是 model = AutoPeftModelForCausalLM.from_pretrained( path to the output directory, device_map="cuda", trust_remote_code=True ).eval()