[Bug] internvl 模型被推理后，针对图片内容回答的答案不正确

seven1122 commented 6 days ago

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.

Describe the bug

相同模型InternVL-Chat-V1-5 ,未加速时回复正确，但是通过加速推理后，针对图片的回答不稳定/不正确

Reproduction

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig from lmdeploy.vl import load_image

pipe = pipeline('models/InternVL-Chat-V1-5', backend_config=TurbomindEngineConfig(tp=1)) gen_config = GenerationConfig(temperature=0.6, max_new_tokens=512) image = load_image('https://xxxxxxxx.jpeg') response = pipe(('三类职业的每人年保费是多少？', image), gen_config=gen_config) print(response)

Environment

lmdeploy==0.4.2

Error traceback

No response

irexyc commented 6 days ago

是说使用 transformers 推理结果看起来正常，使用LMDeploy 推理结果看起来不对么，可以提供复现代码以及数据么？

seven1122 commented 6 days ago

也就是说使用 transformers 推理结果看起来正常，使用 LMDeploy 推理结果看起来不对，我们可以复现代码以及数据吗？

是这样的。transformers推理的代码就是完全hf上给的代码：https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5 lmdeploy推理的代码基本也是和你们文档给例子一样： from lmdeploy import pipeline, TurbomindEngineConfig，GenerationConfig from lmdeploy.vl import load_image

pipe = pipeline('models/InternVL-Chat-V1-5', backend_config=TurbomindEngineConfig(tp=1)) gen_config = GenerationConfig(max_new_tokens=512, temperature=0.6) image = load_image('https://xxxx.jpg') response = pipe(('三类职业的每人年保费是多少', image), gen_config=gen_config)

二者使用的是相同的模型路径，相同的图片地址（因为内网原因，图片没办法提供），相同的query，参数也一样。但是经lmdeploy推理加速后返回的结果不正确不固定，且相同的query，多次调用返回结果不一样, 有时返回的结果是正确的

irexyc commented 6 days ago

top_k 不为1的话，本身就具有随机性。top_k不为1，GenerationConfig 中的 random_seed不一致的话，也会有随机性。

两边top_k 都设置为1的话，结果差别大么？

seven1122 commented 6 days ago

两边都没有手动设置这个值：random_seed 我也试过top_k=1。没任何作用

seven1122 commented 5 days ago

top_k不为1的话，本身就具有随机性。top_k不为1，GenerationConfig 中的random_seed不一致的话，也会有随机性。

两边top_k都设置为1的话，结果差别大么？

设置top_k设置1的时候也没什么作用。现在主要发现针对图片进行信息提取时提取的结果不正确，但是也是来自图片内的信息。这可能是lmdeploy的问题呢？还是我的参数设置不对，是否有相关参数设置的指导？

irexyc commented 5 days ago

top_k 设置为1主要是排除 sample 的影响，比较两边 greedy-search 的结果是否有较大的差异。

InternVL-Chat-V1-5 那边的 LLM 部分用的 InternLM2，这个部分应该是测过精度的。如果要定位一下问题的话，只能看一下是不是这里的 input_ids 和 input_embeddings 跟 InternVL-Chat-V1-5 那边有差异了。

一般 LLM 的逻辑是 input_ids -> input_embeds -> ... decoder ... 。VLM 的逻辑差不多，就是 input_embeds 里面有一部分是图片的特征。lmdeploy 处理的逻辑是在 input_ids 中间插入一些dummy id (0)，并复用 transformers 的代码提取图片特征，之后在 input_embeds 中用真正的图片特征替换掉 dummy id 对应的特征，后面的 decode 逻辑不变。

你可以用 InternVL-Chat-V1-5 那边的 input_ids 和图片的 embedding 替换掉这里的值 (ranges做相应的更改)，然后再看看结果，看看是否是tokenizer 或者 vision model 结果的差异造成的。

InternLM / lmdeploy