[Docs] How are multiple images handled?

pseudotensor commented 3 months ago

📚 The doc issue

InternalVL 1.5 handles multiple images, even if not trained for it, as authors say. But I can't see how lmdeploy handles that or not.

In other cases, models like cogvlm2 may not work with multiple images, how is that handled?

etc.

Suggest a potential alternative/fix

No response

irexyc commented 3 months ago

There are some docs you can refer to. For pipeline, you can pass image list instead of single image. For serving, we follow openai gpt4v format which support multi image inputs.

https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/vl_pipeline.md#multi-images-inference https://github.com/InternLM/lmdeploy/blob/main/docs/en/serving/api_server_vl.md#integrate-with-openai

The key question is how to insert the special image token into the prompt. When the input contains multiple images, we follow InternVL-Chat-V1-5 strategy to put the image token before user question.

However, internvl doesn't provide examples that if the user can pass images after the first round. I submitted an issue asking whether internvl supports multi-image interleaved conversations before but haven't received a response yet.

irexyc commented 3 months ago

According to their pined issue https://github.com/OpenGVLab/InternVL/issues, the 'multi-image interleaved' model will be release in June

pseudotensor commented 3 months ago

Yes, but they also say at the end of the paper (and blog) that they are able to give multiple pictures despite not training for it:

https://arxiv.org/abs/2404.16821 https://internvl.github.io/blog/2024-04-30-InternVL-1.5/

Multi-Image Dialogue. As shown in Figure 12, in this ex-
periment, we ask InternVL 1.5 and GPT-4V to compare the
similarities and differences between the two images. As can
be seen, both GPT-4V and InternVL 1.5 provide detailed
and accurate responses. Through this experiment, we dis-
covered that although InternVL 1.5 was trained solely on
single-image inputs, it exhibits strong zero-shot capabilities
for multi-image dialogues.

irexyc commented 3 months ago

I am not sure if we have the same understanding of the term multi-image interleaved conversations. It means the user can have images as input in each round of conversation.

Currently, InternVL 1.5 suppors multiple images as input, but it seem that it only supports use images in the first round conversation otherwise the results will not be good.

And lmdeploy follow their strategy and user can have multi images as input.

pseudotensor commented 3 months ago

Yes, you are probably right, I was just wondering what lmdeploy will do. I guess one can pass in multiple images in first round just with normal GPT-4V chat api style, just not sure it is really handled. E.g. maybe it ignores the other images.

MrD005 commented 2 months ago

Batch prompts inference How to use this in openai compatible deployment

liangofthechen commented 1 month ago

@irexyc 您好。关于多图插入位置的问题请教一下。以下是我的代码

from lmdeploy import pipeline from lmdeploy.vl import load_image from lmdeploy.vl.constants import IMAGE_TOKEN pipe = pipeline('internlm/internlm-xcomposer2d5-7b') imgpaths=[]#是图片存储路径列表 images = [load_image(imgpath) for imgpath in imgpaths] response1 = pipe(('请你根据背景回答以下问题是一个绿色哭着的脸是一个上升的箭头是一个绿色的放大镜请你告诉我图片是什么', images)) response2 = pipe((f'请你根据背景回答以下问题 {IMAGE_TOKEN}是一个绿色哭着的脸 {IMAGE_TOKEN}是一个上升的箭头 {IMAGE_TOKEN}是一个绿色的放大镜请你告诉我图片{IMAGE_TOKEN}是什么', images))

我发现使用上面两个Prompt得到的效果都不好。请问是我的图片插入TOKEN有问题吗？正确的方法是什么？

以下是我的环境信息： sys.platform: linux Python: 3.9.19 | packaged by conda-forge | (main, Mar 20 2024, 12:50:21) [GCC 12.3.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1,2,3,4,5,6,7: Tesla V100-PCIE-32GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.0, V11.0.221 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 2.2.2+cu118 PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.8
NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
CuDNN 8.7
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.17.2+cu118 LMDeploy: 0.5.1+ transformers: 4.33.2 gradio: 3.44.4 fastapi: 0.110.3 pydantic: 2.7.1 triton: 2.2.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX PIX PIX PHB PHB PHB PHB PHB 0-13,28-41 0 N/A GPU1 PIX X PIX PIX PHB PHB PHB PHB PHB 0-13,28-41 0 N/A GPU2 PIX PIX X PIX PHB PHB PHB PHB PHB 0-13,28-41 0 N/A GPU3 PIX PIX PIX X PHB PHB PHB PHB PHB 0-13,28-41 0 N/A GPU4 PHB PHB PHB PHB X PIX PIX PIX PIX 0-13,28-41 0 N/A GPU5 PHB PHB PHB PHB PIX X PIX PIX PIX 0-13,28-41 0 N/A GPU6 PHB PHB PHB PHB PIX PIX X PIX PIX 0-13,28-41 0 N/A GPU7 PHB PHB PHB PHB PIX PIX PIX X PIX 0-13,28-41 0 N/A NIC0 PHB PHB PHB PHB PIX PIX PIX PIX X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_bond_0

多谢。

InternLM / lmdeploy

[Docs] How are multiple images handled? #1686

📚 The doc issue

Suggest a potential alternative/fix