[Bug] 多图推理效果不佳

stay-leave commented 4 days ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

多图推理你好，使用该示例进行多图推理时回答效果很差，我放了三张图片，都是可以访问的，但是返回了五张图片的描述。请问下多图推理是因为InternVL-Chat-V1-5具有多图处理能力，还是在推理侧做了什么改动？

Reproduction

我的代码：

from lmdeploy.vl import load_image

pipe = pipeline('`InternVL-Chat-V1-5`',
                backend_config=TurbomindEngineConfig(session_len=8192, tp=2))

image_urls=[
    'https://i0.wp.com/wx3.sinaimg.cn/orj360/008yp4y7gy1hqfhh0d5eij30uq0r6tgo.jpg',
    'https://i0.wp.com/wx4.sinaimg.cn/mw2000/006tMkIxgy1hqvuzwc4bsj30u00u0wko.jpg',
    'https://i0.wp.com/wx2.sinaimg.cn/thumb180/0082PBjgly1hqvs524n96j30qo1a2qb6.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
response = pipe(('describe these images', images))
print(response)```

### Environment

```Shell
ys.platform: linux
Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: gcc (Debian 10.2.1-6) 10.2.1 20210110
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.2+cu121
LMDeploy: 0.4.2+708e919
transformers: 4.41.2
gradio: Not Found
fastapi: 0.111.0
pydantic: 2.7.4
triton: 2.2.0

Error traceback

Response(text='1. 这张图片展示了一位女士站在一座山间的石阶上。她穿着一件黄色的外套，戴着墨镜，手持一根手杖，看起来像是在进行徒步旅行。背景中可以看到茂密的树木和岩石，给人一种宁静而自然的感觉。\n\n2. 这张图片是一盘食物。在盘子的中央是一块炸鸡块，旁边是一根玉米棒，还有几种蔬菜，包括豆角、洋葱和可能是蘑菇的食材。在盘子的旁边，有一双筷子放在盘子上。\n\n3. 这张图片展示了一个装有水果的碗。碗里装满了各种水果，包括苹果、蓝莓和桃子。水果的颜色鲜艳，看起来非常新鲜。\n\n4. 这张图片展示了一个装有液体的玻璃杯，可能是水或其他饮料。玻璃杯的设计非常独特，有着复杂的图案和装饰。\n\n5. 这张图片是一张风景照，拍摄的是山间的小路。小路蜿蜒曲折，两旁是郁郁葱葱的树木和植被。远处可以看到一座红色的建筑，可能是寺庙或亭子。整体上，这张图片给人一种宁静和自然的感觉。', generate_token_len=228, input_token_len=1835, session_id=0, finish_reason='stop', token_ids=[312, 281, 262, 72998, 68467, 69552, 89967, 72661, 70421, 73127, 60631, 71991, 60941, 61527, 60370, 60355, 60564, 71294, 70599, 80342, 74829, 60353, 84035, 61945, 61458, 60353, 81722, 73004, 60454, 63293, 60353, 69835, 60770, 68657, 68274, 87281, 70077, 60355, 68807, 60366, 69363, 62824, 83225, 77054, 60381, 75851, 60353, 80255, 79917, 60458, 68519, 69440, 60355, 402, 314, 281, 262, 72998, 68467, 68321, 60957, 68757, 60355, 60361, 60957, 68787, 68887, 68321, 61006, 61652, 82710, 60353, 70185, 68321, 60772, 69502, 61963, 60353, 68350, 71961, 69644, 60353, 68469, 76367, 60359, 70549, 60381, 69454, 73073, 60354, 68572, 60355, 60361, 60957, 68787, 70185, 60353, 68510, 60968, 71762, 68677, 60957, 71024, 60355, 402, 308, 281, 262, 72998, 68467, 69552, 68849, 80824, 69397, 60354, 61679, 60355, 74452, 60634, 75117, 68459, 69397, 60353, 68469, 68963, 60359, 81216, 60381, 86973, 60355, 69397, 71588, 79057, 60353, 69835, 68335, 70022, 60355, 402, 319, 281, 262, 72998, 68467, 69552, 68849, 80824, 61258, 69743, 69601, 61540, 60353, 69454, 60456, 76531, 71886, 60355, 69601, 61540, 71011, 68335, 70216, 60353, 69703, 75263, 70742, 60381, 69565, 60355, 402, 317, 281, 262, 72998, 68467, 68321, 60862, 70400, 60863, 60353, 69582, 68332, 60631, 60470, 68447, 60586, 60355, 60398, 60586, 91014, 83452, 60353, 60583, 61723, 60357, 62117, 62117, 61303, 61303, 60354, 77054, 60381, 88944, 60355, 81121, 69363, 73127, 74789, 68620, 60353, 69454, 82576, 60535, 62608, 60401, 60355, 69217, 60370, 60353, 72998, 68467, 80255, 79917, 60381, 68519, 69440, 60355], logprobs=None)

irexyc commented 4 days ago

可以看一下InternVL置顶的issue，应该有解释。https://github.com/OpenGVLab/InternVL/issues

目前的InternVL-Chat-V1-5 训练的时候应该是单图来训练的，所以多图能力并不强，他们说六月会出多图训练的模型，huggingface上面目前貌似还没有看到。

stay-leave commented 4 days ago

好的，感谢回复

InternLM / lmdeploy

[Bug] 多图推理效果不佳 #1843

Checklist

Describe the bug

Reproduction

Error traceback