[Bug] InternVL2-Llama3-76B-AWQ infer result is wrong

AmazDeng commented 3 months ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I used lmdeploy to load the InternVL2-Llama3-76B-AWQ model for inference. My inference mode is to input two images at a time and ask the model whether the two images are the same. I conducted such inferences 300 times in total and found that all the results were "Yes"(300 different pictures). However, when I tested with InternVL2-40B-AWQ, there was no such issue, with some results being "Yes" and some "No". The inference code used by the two models is exactly the same, only the model paths are different. Clearly, most of the results from InternVL2-40B-AWQ are correct, while most of the results from InternVL2-Llama3-76B-AWQ are incorrect. Why is this?

Reproduction

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
from lmdeploy import GenerationConfig

# model_path = '/media/star/disk2/pretrained_model/InternVL/InternVL2/InternVL2-40B-AWQ'
model_path = '/media/star/disk2/pretrained_model/InternVL/InternVL2/InternVL2-Llama3-76B-AWQ'
model = pipeline(model_path, backend_config=TurbomindEngineConfig(session_len=8192))

target_images=["190.jpg","195.jpg","196.jpg","266.jpg","343.jpg","638.jpg","1109.jpg","1200.jpg","1476.jpg"]

for ele in target_images:
    ref_image_pixel = load_image(f"./ref/1.jpg")
    target_image_pixel = load_image(f"./target/{ele}")

    question="Image-1: <image>\\nImage-2: <image>\\nAre these two pieces of coats exactly the same except for the color? Answer Yes or No in one word."

    response_l = model([(question, [ref_image_pixel,target_image_pixel])],
                          gen_config=GenerationConfig(
                              max_new_tokens=1024,
                              top_p=1
                          ))
    print(response_l[0].text)

image examples image.zip

InternVL2-40B-AWQ infer result: 190:Yes 195:Yes 196:Yes 266:No 343:Yes 638:No 1109:No 1200:No 1476:No

InternVL2-Llama3-76B-AWQ infer result: 190:Yes 195:Yes 196:Yes 266:Yes 343:Yes 638:Yes 1109:Yes 1200:Yes 1476:Yes

Environment

sys.platform: linux
Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.91
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.3.1+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.18.1+cu121
LMDeploy: 0.5.3+
transformers: 4.44.2
gradio: Not Found
fastapi: 0.112.2
pydantic: 2.8.2
triton: 2.3.1
NVIDIA Topology: 
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      20-39,60-79     1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks



### Error traceback

_No response_

AmazDeng commented 3 months ago

@czczup @whai362 @ErfeiCui @hjh0119 @lvhan028 @Adushar @Weiyun1025 @cg1177 @opengvlab-admin @qishisuren123 @dlutwy Could you please take a look at this issue?

lvhan028 commented 3 months ago

Could you try to verify this case with the unquantized InternVL2-Llama3-76B model?

irexyc commented 3 months ago

@AmazDeng

Can you try if this question works?

question="Image-1: <img><IMAGE_TOKEN></img>\nImage-2: <img><IMAGE_TOKEN></img>\nAre these two pieces of coats exactly the same except for the color? Answer Yes or No in one word."

AmazDeng commented 3 months ago

I have updated the test images and the code. You can also test this case on your local machine. I only have an A100 80G graphics card, so I can only load the AWQ version, not the non-quantized version. @irexyc @lvhan028

irexyc commented 3 months ago

I have updated the test images and the code. You can also test this case on your local machine.

I will test it later today.

AmazDeng commented 3 months ago

@AmazDeng

Can you try if this question works?

question="Image-1: <img><IMAGE_TOKEN></img>\nImage-2: <img><IMAGE_TOKEN></img>\nAre these two pieces of coats exactly the same except for the color? Answer Yes or No in one word."

@irexyc I've tested it, your prompt works, and the results are normal. The prompt provided by the official website also works: Image-1: <IMAGE_TOKEN>\ Image-2: <IMAGE_TOKEN>\nAre these two pieces of coats exactly the same except for the color? Answer Yes or No in one word.The results are also normal.

It seems I misunderstood the prompt format. I directly took the prompt from PyTorch, which appears to work on lmdeploy+InternVL2-40B-AWQ, but does not function correctly on lmdeploy+InternVL2-Llama3-76B-AWQ.

However, based on my test results, InternVL2-Llama3-76B-AWQ's capabilities are not as good as InternVL2-40B-AWQ's.

AmazDeng commented 3 months ago

@irexyc I noticed that the prompt you provided contains the symbol, which is not included in the official version(f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images',https://internvl.readthedocs.io/en/latest/internvl2.0/deployment.html#multi-images-inference).

Also,I noticed that there is an inconsistency between the inference sections on the lmdeploy website for InternVL and the inference section of InternVL itself. Specifically, the prompt formats are different: the former includes <img></img>, while the latter does not.

My questions are:

Are the prompts in the two pieces of code equivalent?One contains the <img></img> symbol, the other does not contain this symbol
Is the inference code equivalent?

Here is the inference code from the lmdeploy website for InternVL（https://lmdeploy.readthedocs.io/en/latest/multi_modal/internvl.html）:

from lmdeploy import pipeline, GenerationConfig
from lmdeploy.vl.constants import IMAGE_TOKEN

pipe = pipeline('OpenGVLab/InternVL2-8B', log_level='INFO')
messages = [
    dict(role='user', content=[
        dict(type='text', text=f'Image-1: <img>{IMAGE_TOKEN}</img>\nImage-2: <img>{IMAGE_TOKEN}</img>\nDescribe the two images in detail.'),
        dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='https://raw.githubusercontent.com/OpenGVLab/InternVL/main/internvl_chat/examples/image1.jpg')),
        dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='https://raw.githubusercontent.com/OpenGVLab/InternVL/main/internvl_chat/examples/image2.jpg'))
    ])
]
out = pipe(messages, gen_config=GenerationConfig(top_k=1))

messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
out = pipe(messages, gen_config=GenerationConfig(top_k=1))

Here is the inference code from InternVL2（https://internvl.readthedocs.io/en/latest/internvl2.0/deployment.html#multi-images-inference）:

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN

model = 'OpenGVLab/InternVL2-Llama3-76B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# Numbering images improves multi-image conversations
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

irexyc commented 3 months ago

@AmazDeng

Are the prompts in the two pieces of code equivalent?One contains the symbol, the other does not contain this symbol

In short, <img>{IMAGE_TOKEN}</img>\n is the right smybol.

If you don't add image token to prompt but provide image input, the official code will actually add <img>{place holder}...</img>\n before the question. The behavior of lmdeploy is same as official code except for the {place holder} token. But that doesn't matter as the {place holder} will be eventually replace by image features.

While if you wan't to customize the location of the image token in lmdeploy, currently you should use <img>{IMAGE_TOKEN}</img>\n for internvl2 model. This is indeed confusing and inconsistent with other vlm models. I think we will remove <img>/</img> and use <IMAGE_TOKEN> instead in the next release.

Is the inference code equivalent?

Compared with transformers, there are two differences. One is that the ViT in lmdeploy is inference in fp16 mode. The other is the kernel implement (gemm, attention). Besides these two differences, the inference logic is same with transformers.

AmazDeng commented 3 months ago

Understood, thank you for your reply. If it's convenient, could you please help me resolve another issue I've raised?https://github.com/OpenGVLab/InternVL/issues/549 @irexyc

OpenGVLab / InternVL