Open AmazDeng opened 2 months ago
@czczup @whai362 @ErfeiCui @hjh0119 @lvhan028 @Adushar @Weiyun1025 @cg1177 @opengvlab-admin @qishisuren123 @dlutwy Could you please take a look at this issue?
Could you try to verify this case with the unquantized InternVL2-Llama3-76B model?
@AmazDeng
Can you try if this question works?
question="Image-1: <img><IMAGE_TOKEN></img>\nImage-2: <img><IMAGE_TOKEN></img>\nAre these two pieces of coats exactly the same except for the color? Answer Yes or No in one word."
I have updated the test images and the code. You can also test this case on your local machine. I only have an A100 80G graphics card, so I can only load the AWQ version, not the non-quantized version. @irexyc @lvhan028
I have updated the test images and the code. You can also test this case on your local machine.
I will test it later today.
@AmazDeng
Can you try if this question works?
question="Image-1: <img><IMAGE_TOKEN></img>\nImage-2: <img><IMAGE_TOKEN></img>\nAre these two pieces of coats exactly the same except for the color? Answer Yes or No in one word."
@irexyc
I've tested it, your prompt works, and the results are normal. The prompt provided by the official website also works: Image-1: <IMAGE_TOKEN>\ Image-2: <IMAGE_TOKEN>\nAre these two pieces of coats exactly the same except for the color? Answer Yes or No in one word.
The results are also normal.
It seems I misunderstood the prompt format. I directly took the prompt from PyTorch, which appears to work on lmdeploy+InternVL2-40B-AWQ, but does not function correctly on lmdeploy+InternVL2-Llama3-76B-AWQ.
However, based on my test results, InternVL2-Llama3-76B-AWQ's capabilities are not as good as InternVL2-40B-AWQ's.
@irexyc I noticed that the prompt you provided contains the symbol, which is not included in the official version(f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images',https://internvl.readthedocs.io/en/latest/internvl2.0/deployment.html#multi-images-inference).
Also,I noticed that there is an inconsistency between the inference sections on the lmdeploy website for InternVL and the inference section of InternVL itself. Specifically, the prompt formats are different: the former includes <img></img>
, while the latter does not.
My questions are:
<img></img>
symbol, the other does not contain this symbolHere is the inference code from the lmdeploy website for InternVL(https://lmdeploy.readthedocs.io/en/latest/multi_modal/internvl.html):
from lmdeploy import pipeline, GenerationConfig
from lmdeploy.vl.constants import IMAGE_TOKEN
pipe = pipeline('OpenGVLab/InternVL2-8B', log_level='INFO')
messages = [
dict(role='user', content=[
dict(type='text', text=f'Image-1: <img>{IMAGE_TOKEN}</img>\nImage-2: <img>{IMAGE_TOKEN}</img>\nDescribe the two images in detail.'),
dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='https://raw.githubusercontent.com/OpenGVLab/InternVL/main/internvl_chat/examples/image1.jpg')),
dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='https://raw.githubusercontent.com/OpenGVLab/InternVL/main/internvl_chat/examples/image2.jpg'))
])
]
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
Here is the inference code from InternVL2(https://internvl.readthedocs.io/en/latest/internvl2.0/deployment.html#multi-images-inference):
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN
model = 'OpenGVLab/InternVL2-Llama3-76B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]
images = [load_image(img_url) for img_url in image_urls]
# Numbering images improves multi-image conversations
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)
@AmazDeng
- Are the prompts in the two pieces of code equivalent?One contains the symbol, the other does not contain this symbol
In short, <img>{IMAGE_TOKEN}</img>\n
is the right smybol.
If you don't add image token to prompt but provide image input, the official code will actually add <img>{place holder}...</img>\n
before the question. The behavior of lmdeploy is same as official code except for the {place holder} token. But that doesn't matter as the {place holder} will be eventually replace by image features.
While if you wan't to customize the location of the image token in lmdeploy, currently you should use <img>{IMAGE_TOKEN}</img>\n
for internvl2 model. This is indeed confusing and inconsistent with other vlm models. I think we will remove <img>/</img>
and use <IMAGE_TOKEN>
instead in the next release.
- Is the inference code equivalent?
Compared with transformers, there are two differences. One is that the ViT in lmdeploy is inference in fp16 mode. The other is the kernel implement (gemm, attention). Besides these two differences, the inference logic is same with transformers.
Understood, thank you for your reply. If it's convenient, could you please help me resolve another issue I've raised?https://github.com/OpenGVLab/InternVL/issues/549 @irexyc
Checklist
Describe the bug
I used lmdeploy to load the InternVL2-Llama3-76B-AWQ model for inference. My inference mode is to input two images at a time and ask the model whether the two images are the same. I conducted such inferences 300 times in total and found that all the results were "Yes"(300 different pictures). However, when I tested with InternVL2-40B-AWQ, there was no such issue, with some results being "Yes" and some "No". The inference code used by the two models is exactly the same, only the model paths are different. Clearly, most of the results from InternVL2-40B-AWQ are correct, while most of the results from InternVL2-Llama3-76B-AWQ are incorrect. Why is this?
Reproduction
image examples image.zip
InternVL2-40B-AWQ infer result: 190:Yes 195:Yes 196:Yes 266:No 343:Yes 638:No 1109:No 1200:No 1476:No
InternVL2-Llama3-76B-AWQ infer result: 190:Yes 195:Yes 196:Yes 266:Yes 343:Yes 638:Yes 1109:Yes 1200:Yes 1476:Yes
Environment