OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
https://internvl.readthedocs.io/en/latest/
MIT License
5.4k stars 421 forks source link

Common Issue Summary 常见问题汇总 #232

Closed czczup closed 5 hours ago

czczup commented 3 months ago

Hi everyone,

This is a Common Issue Summary where I will compile the frequently encountered issues. If you notice any omissions, please feel free to help add to the list. Thank you!

这里是常见问题汇总,我会在这里汇总一些常见的问题。如果有遗漏的地方,请大家帮忙补充,谢谢!

czczup commented 3 months ago

I will summarize common issues here.

1. Multi-GPU Inference - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Issues: #229, #118

Many people have encountered this bug, and we haven't yet found a good method to handle all cases. However, there is a workaround that requires manually assigning devices to the model.

For example, deploying this 26B model on two V100 GPUs:

The model is a total of 26B, so the ideal situation is 13B per card. Therefore, after excluding the 6B for ViT, card 0 needs to hold 7B, which means 1/3 of the 20B LLM is on card 0, and 2/3 is on card 1.

In code, it would look like this:

device_map = {
    'vision_model': 0,
    'mlp1': 0,
    'language_model.model.tok_embeddings': 0,  # near the first layer of LLM
    'language_model.model.norm': 1,  # near the last layer of LLM
    'language_model.output.weight': 1  # near the last layer of LLM
}
for i in range(16):
    device_map[f'language_model.model.layers.{i}'] = 0
for i in range(16, 48):
    device_map[f'language_model.model.layers.{i}'] = 1
print(device_map)
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map
).eval()

2. Multi-Image Inference - When the number of images exceeds two, the model seems to treat all the input as one image. From the code, the model seems to input all the blocks to the model together, without distinguishing between different images. Even with lmdeploy, the problem is the same.

Issues: #223,

The current V1.5 model was not trained with such (interleaved) data. Modifying the inference interface can support it, but the results are unstable.

The June version will include multi-image interleaved training, which should improve performance. The code will also support this feature at that time.

3. Prompt Format

Issues: #227

TODO

4. Quantification - AWQ / INT4 Quantification, Low GPU utilization during int8 model inference

Issues: #209, #210, #193, #167

Thanks to the lmdeploy team for providing AWQ quantization support.

The 4-bit model is available at OpenGVLab/InternVL-Chat-V1-5-AWQ. You can try this one.