Closed TerryYiDa closed 1 month ago
您好,感谢反馈,这个问题也有其他人遇到了,我们会优先解决这个问题。
期待解决。。。。,对你们的项目点赞
@czczup i triggered and encountered the same issue, could you guys solve this and help us to experience this prefect project? Appreciate !
you can sea example of multi gpu for inference from https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5. And similar solution from https://github.com/OpenGVLab/InternVL/issues/96, and A part of the code requires modification.
# modeling_internval_chat.py line 353
input_embeds[selected] = vit_embeds.reshape(-1, C).cuda()
OpenGVLab/InternVL-Chat-V1-5-Int8使用此版本也是同样的问题
OpenGVLab/InternVL-Chat-V1-5-Int8使用此版本也是同样的问题,求修复
there is a hack you can use to fix this problem, basically aligning all the devices in the transformer library manually. But that will require you to fix it ad hoc.
期待解决,同样的问题!
这个bug很多人都遇到了,暂时没想到比较好的方法能够适配所有情况,有个方案可以解决,需要手工为模型分配设备。
以两张V100显卡部署这个26B模型为例:
这个模型总共26B,2张卡最理想的情况是每张卡13B。因此,除去ViT的6B以外,卡0还需要放7B,也就是20B的LLM有1/3在卡0,有2/3在卡1。
写成代码就是:
device_map = {
'vision_model': 0,
'mlp1': 0,
'language_model.model.tok_embeddings': 0, # near the first layer of LLM
'language_model.model.norm': 1, # near the last layer of LLM
'language_model.output.weight': 1 # near the last layer of LLM
}
for i in range(16):
device_map[f'language_model.model.layers.{i}'] = 0
for i in range(16, 48):
device_map[f'language_model.model.layers.{i}'] = 1
print(device_map)
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map=device_map
).eval()
Many people have encountered this bug, and we haven't yet found a good method to handle all cases. However, there is a workaround that requires manually assigning devices to the model.
For example, deploying this 26B model on two V100 GPUs:
The model is a total of 26B, so the ideal situation is 13B per card. Therefore, after excluding the 6B for ViT, card 0 needs to hold 7B, which means 1/3 of the 20B LLM is on card 0, and 2/3 is on card 1.
In code, it would look like this:
device_map = {
'vision_model': 0,
'mlp1': 0,
'language_model.model.tok_embeddings': 0, # near the first layer of LLM
'language_model.model.norm': 1, # near the last layer of LLM
'language_model.output.weight': 1 # near the last layer of LLM
}
for i in range(16):
device_map[f'language_model.model.layers.{i}'] = 0
for i in range(16, 48):
device_map[f'language_model.model.layers.{i}'] = 1
print(device_map)
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map=device_map
).eval()
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:7!