THUDM / CogVLM2

GPT4V-level open-source multi-modal model based on Llama3-8B
Apache License 2.0
2.03k stars 135 forks source link

Multi-GPU inference Error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:5 and cuda:6! #32

Closed RussellEven closed 2 months ago

RussellEven commented 4 months ago

System Info / 系統信息

system version: Ubuntu 20.04 LTS cuda version: 11.8 python version: 3.10.12 torch version: 2.3.0+cu118 xformers version: 0.0.26.post1+cu118

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

Bug Info

.../huggingface/modules/transformers_modules/cogvlm2-llama3-chat-19B/visual.py", line 83, in forward output = mlp_input + mlp_output RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:5 and cuda:6!

Demo is above:

_import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoTokenizer from torch.nn.parallel import DistributedDataParallel as DDP import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1, 2, 3, 4, 5, 6, 7" max_memory_mapping = {0: "20GB", 1: "20GB", 2: "20GB", 3: "20GB", 4: "20GB", 5: "20GB", 6: "20GB", 7: "20GB"}

MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"

MODEL_PATH = "./cogvlm2-llama3-chat-19B"

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu' TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained( MODEL_PATH, trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained( MODEL_PATH, device_map='auto', max_memory=max_memory_mapping, load_in_8bit=False, torch_dtype=TORCH_TYPE, trust_remote_code=True, ).to(DEVICE).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True: image_path = input("image path >>>>> ") if image_path == '': print('You did not enter image path, the following will be a plain text conversation.') image = None text_only_first_query = True else: image = Image.open(image_path).convert('RGB')

history = []

while True:
    query = input("Human:")
    if query == "clear":
        break

    if image is None:
        if text_only_first_query:
            query = text_only_template.format(query)
            text_only_first_query = False
        else:
            old_prompt = ''
            for _, (old_query, response) in enumerate(history):
                old_prompt += old_query + " " + response + "\n"
            query = old_prompt + "USER: {} ASSISTANT:".format(query)
    if image is None:
        input_by_model = model.build_conversation_input_ids(
            tokenizer,
            query=query,
            history=history,
            template_version='chat'
        )
    else:
        input_by_model = model.build_conversation_input_ids(
            tokenizer,
            query=query,
            history=history,
            images=[image],
            template_version='chat'
        )
    inputs = {
        'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
        'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
        'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
        'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
    }
    gen_kwargs = {
        "max_new_tokens": 2048,
        "pad_token_id": 128002,  
    }
    print(inputs)
    with torch.no_grad():
        outputs = model.generate(**inputs, **gen_kwargs)
        outputs = outputs[:, inputs['input_ids'].shape[1]:]
        response = tokenizer.decode(outputs[0])
        response = response.split("<|end_of_text|>")[0]
        print("\nCogVLM2:", response)
    history.append((query, response))_

Expected behavior / 期待表现

A available multi-gpu run demo in future repo!

zRzRzRzRzRzRzR commented 4 months ago

use basic_demo/cli_demo_multi_gpus.py

yhygta commented 4 months ago

使用的就是basic_demo/cli_demo_multi_gpus.py,还是一样的报错

Jayantverma2 commented 4 months ago

how to do multimle gpu inference with peft weights as cog

yhygta commented 4 months ago

使用的就是basic_demo/cli_demo_multi_gpus.py,还是一样的报错 @zRzRzRzRzRzRzR I use basic_demo/cli_demo_multi_gpus.py,the same error: Traceback (most recent call last): File "/opt/bitmatrix/src/share-serv/serv_misc/src/cg2.py", line 100, in outputs = model.generate(inputs, gen_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/transformers/generation/utils.py", line 1758, in generate result = self._sample( ^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/transformers/generation/utils.py", line 2397, in _sample outputs = self( ^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/modeling_cogvlm.py", line 620, in forward outputs = self.model( ^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/modeling_cogvlm.py", line 389, in forward images_features = self.encode_images(images) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/modeling_cogvlm.py", line 361, in encode_images images_features = self.vision(images) ^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/visual.py", line 130, in forward x = self.transformer(x) ^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/visual.py", line 94, in forward hidden_states = layer_module(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/visual.py", line 83, in forward output = mlp_input + mlp_output


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:2!
zRzRzRzRzRzRzR commented 4 months ago

设置成2个卡试试?

yhygta commented 4 months ago

设置成2个卡试试?

您好,设置成两张卡,就报显存不足的错误了(22G的显存,且空闲着,不论max_memory_per_gpu设置为多少,都报显存不足)

lvbinandylau commented 4 months ago

一样的问题。8张P100,怎么都不行。

valencebond commented 4 months ago

same problem

zRzRzRzRzRzRzR commented 4 months ago

每张显卡分配16G以上,最多三张卡

zRzRzRzRzRzRzR commented 4 months ago

一样的问题。8张P100,怎么都不行。

P100应该是驱动,算子的问题了,要寻找对应的xformers版本(如果有支持这个卡)

alice20212 commented 4 months ago

每张显卡分配16G以上,最多三张卡

我使用3张4090成功了

yhygta commented 4 months ago

3张2080Ti 22G,还是显存不足 o(╥﹏╥)o

Andy-Zyu commented 4 months ago

3张4090我也可以成功,但就于多并发任务时的推理速度上不来。仓库up主有没有方法能通过4张或是8张显卡来auto_map一下

liuky74 commented 3 months ago

可以修改一下device_map, 某一层的权重被分配到不同显卡上了, 比如像这样: image 这里在vision.transformer.layers.8下就会出现tensor计算不在同一个device上, 像我举得例子里你可以把layer.8整个改在同一个设备上

sevenclay commented 3 months ago

3张4090我也可以成功,但就于多并发任务时的推理速度上不来。仓库up主有没有方法能通过4张或是8张显卡来auto_map一下

成功了吗?

tingxueronghua commented 2 months ago

一样的问题,感觉是需要用权重被分到不同的卡上了?

HJT9328 commented 2 months ago

出现了相同的问题 求解决

WangWei990215 commented 1 month ago

使用的就是basic_demo/cli_demo_multi_gpus.py,还是一样的报错

我也遇到了同样的错误,请问解决了吗

byerose commented 1 month ago

使用的就是basic_demo/cli_demo_multi_gpus.py,还是一样的报错

我也遇到了同样的错误,请问解决了吗

我也是 可以参考:

https://github.com/THUDM/CogVLM/issues/256

https://huggingface.co/THUDM/cogagent-chat-hf/tree/main