多GPU推理报错 - Githubissues

TerryYiDa commented 2 months ago

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:7!

czczup commented 2 months ago

您好，感谢反馈，这个问题也有其他人遇到了，我们会优先解决这个问题。

baigang666 commented 2 months ago

期待解决。。。。，对你们的项目点赞

hyhzl commented 2 months ago

@czczup i triggered and encountered the same issue, could you guys solve this and help us to experience this prefect project? Appreciate !

NiYueLiuFeng commented 2 months ago

you can sea example of multi gpu for inference from https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5. And similar solution from https://github.com/OpenGVLab/InternVL/issues/96, and A part of the code requires modification.

# modeling_internval_chat.py    line 353

input_embeds[selected] = vit_embeds.reshape(-1, C).cuda()

gaodianzhuo commented 2 months ago

OpenGVLab/InternVL-Chat-V1-5-Int8使用此版本也是同样的问题

winca commented 2 months ago

OpenGVLab/InternVL-Chat-V1-5-Int8使用此版本也是同样的问题，求修复

kyleliang919 commented 2 months ago

there is a hack you can use to fix this problem, basically aligning all the devices in the transformer library manually. But that will require you to fix it ad hoc.

ChengLigen commented 1 month ago

期待解决，同样的问题！

czczup commented 1 month ago

这个bug很多人都遇到了，暂时没想到比较好的方法能够适配所有情况，有个方案可以解决，需要手工为模型分配设备。

以两张V100显卡部署这个26B模型为例：

这个模型总共26B，2张卡最理想的情况是每张卡13B。因此，除去ViT的6B以外，卡0还需要放7B，也就是20B的LLM有1/3在卡0，有2/3在卡1。

写成代码就是：

device_map = {
    'vision_model': 0,
    'mlp1': 0,
    'language_model.model.tok_embeddings': 0,  # near the first layer of LLM
    'language_model.model.norm': 1,  # near the last layer of LLM
    'language_model.output.weight': 1  # near the last layer of LLM
}
for i in range(16):
    device_map[f'language_model.model.layers.{i}'] = 0
for i in range(16, 48):
    device_map[f'language_model.model.layers.{i}'] = 1
print(device_map)
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map
).eval()

czczup commented 1 month ago

Many people have encountered this bug, and we haven't yet found a good method to handle all cases. However, there is a workaround that requires manually assigning devices to the model.

For example, deploying this 26B model on two V100 GPUs:

The model is a total of 26B, so the ideal situation is 13B per card. Therefore, after excluding the 6B for ViT, card 0 needs to hold 7B, which means 1/3 of the 20B LLM is on card 0, and 2/3 is on card 1.

In code, it would look like this:

device_map = {
    'vision_model': 0,
    'mlp1': 0,
    'language_model.model.tok_embeddings': 0,  # near the first layer of LLM
    'language_model.model.norm': 1,  # near the last layer of LLM
    'language_model.output.weight': 1  # near the last layer of LLM
}
for i in range(16):
    device_map[f'language_model.model.layers.{i}'] = 0
for i in range(16, 48):
    device_map[f'language_model.model.layers.{i}'] = 1
print(device_map)
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map
).eval()

OpenGVLab / InternVL

多GPU推理报错 #118