I will summarize common issues here.

1. Multi-GPU Inference - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Issues: #229, #118

Many people have encountered this bug, and we haven't yet found a good method to handle all cases. However, there is a workaround that requires manually assigning devices to the model.

For example, deploying this 26B model on two V100 GPUs:

The model is a total of 26B, so the ideal situation is 13B per card. Therefore, after excluding the 6B for ViT, card 0 needs to hold 7B, which means 1/3 of the 20B LLM is on card 0, and 2/3 is on card 1.

In code, it would look like this:

device_map = {
    'vision_model': 0,
    'mlp1': 0,
    'language_model.model.tok_embeddings': 0,  # near the first layer of LLM
    'language_model.model.norm': 1,  # near the last layer of LLM
    'language_model.output.weight': 1  # near the last layer of LLM
}
for i in range(16):
    device_map[f'language_model.model.layers.{i}'] = 0
for i in range(16, 48):
    device_map[f'language_model.model.layers.{i}'] = 1
print(device_map)
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map
).eval()

2. Multi-Image Inference - When the number of images exceeds two, the model seems to treat all the input as one image. From the code, the model seems to input all the blocks to the model together, without distinguishing between different images. Even with lmdeploy, the problem is the same.

Issues: #223,

The current V1.5 model was not trained with such (interleaved) data. Modifying the inference interface can support it, but the results are unstable.

The June version will include multi-image interleaved training, which should improve performance. The code will also support this feature at that time.

3. Prompt Format

Issues: #227

TODO

4. Quantification - AWQ / INT4 Quantification, Low GPU utilization during int8 model inference

Issues: #209, #210, #193, #167

Thanks to the lmdeploy team for providing AWQ quantization support.

The 4-bit model is available at OpenGVLab/InternVL-Chat-V1-5-AWQ. You can try this one.

OpenGVLab / InternVL

Common Issue Summary 常见问题汇总 #232

I will summarize common issues here.

1. Multi-GPU Inference - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

2. Multi-Image Inference - When the number of images exceeds two, the model seems to treat all the input as one image. From the code, the model seems to input all the blocks to the model together, without distinguishing between different images. Even with lmdeploy, the problem is the same.

3. Prompt Format

4. Quantification - AWQ / INT4 Quantification, Low GPU utilization during int8 model inference