OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. 接近GPT-4V表现的可商用开源多模态对话模型
https://internvl.github.io/
MIT License
3.98k stars 304 forks source link

多GPU推理报错 #118

Closed TerryYiDa closed 1 month ago

TerryYiDa commented 2 months ago

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:7!

czczup commented 2 months ago

您好,感谢反馈,这个问题也有其他人遇到了,我们会优先解决这个问题。

baigang666 commented 2 months ago

期待解决。。。。,对你们的项目点赞

hyhzl commented 2 months ago

@czczup i triggered and encountered the same issue, could you guys solve this and help us to experience this prefect project? Appreciate !

NiYueLiuFeng commented 2 months ago

you can sea example of multi gpu for inference from https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5. And similar solution from https://github.com/OpenGVLab/InternVL/issues/96, and A part of the code requires modification.

# modeling_internval_chat.py    line 353

input_embeds[selected] = vit_embeds.reshape(-1, C).cuda()
gaodianzhuo commented 2 months ago

OpenGVLab/InternVL-Chat-V1-5-Int8使用此版本也是同样的问题

winca commented 2 months ago

OpenGVLab/InternVL-Chat-V1-5-Int8使用此版本也是同样的问题,求修复

kyleliang919 commented 2 months ago

there is a hack you can use to fix this problem, basically aligning all the devices in the transformer library manually. But that will require you to fix it ad hoc.

ChengLigen commented 1 month ago

期待解决,同样的问题! image

czczup commented 1 month ago

这个bug很多人都遇到了,暂时没想到比较好的方法能够适配所有情况,有个方案可以解决,需要手工为模型分配设备。

以两张V100显卡部署这个26B模型为例:

这个模型总共26B,2张卡最理想的情况是每张卡13B。因此,除去ViT的6B以外,卡0还需要放7B,也就是20B的LLM有1/3在卡0,有2/3在卡1。

写成代码就是:

device_map = {
    'vision_model': 0,
    'mlp1': 0,
    'language_model.model.tok_embeddings': 0,  # near the first layer of LLM
    'language_model.model.norm': 1,  # near the last layer of LLM
    'language_model.output.weight': 1  # near the last layer of LLM
}
for i in range(16):
    device_map[f'language_model.model.layers.{i}'] = 0
for i in range(16, 48):
    device_map[f'language_model.model.layers.{i}'] = 1
print(device_map)
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map
).eval()
czczup commented 1 month ago

Many people have encountered this bug, and we haven't yet found a good method to handle all cases. However, there is a workaround that requires manually assigning devices to the model.

For example, deploying this 26B model on two V100 GPUs:

The model is a total of 26B, so the ideal situation is 13B per card. Therefore, after excluding the 6B for ViT, card 0 needs to hold 7B, which means 1/3 of the 20B LLM is on card 0, and 2/3 is on card 1.

In code, it would look like this:

device_map = {
    'vision_model': 0,
    'mlp1': 0,
    'language_model.model.tok_embeddings': 0,  # near the first layer of LLM
    'language_model.model.norm': 1,  # near the last layer of LLM
    'language_model.output.weight': 1  # near the last layer of LLM
}
for i in range(16):
    device_map[f'language_model.model.layers.{i}'] = 0
for i in range(16, 48):
    device_map[f'language_model.model.layers.{i}'] = 1
print(device_map)
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map
).eval()