OpenBMB / MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Apache License 2.0
12.14k stars 849 forks source link

关于VLM计数推理幻觉的询问 #214

Closed kaixin-bai closed 4 months ago

kaixin-bai commented 4 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

Screenshot_2024-06-04_11-43-32

在repo中有提到使用了LLaVA-UHD论文的工作,这个工作中有提到通过改进图片的切片方式来改善VLM模型的计数问题。实际经过多次测试时发现,图片中的塑料水果会被重复计数,每次重复计数的物体不同,有的时候是绿色的梨子,有的时候是紫色的李子,有的时候草莓还无法检测到。请问这个工作在训练的时候是否有考虑计数幻觉的问题啊?

另外这个场景我之前有用chatgpt的多模态模型测试,在刚推出的多模态模型中,可以非常准确的指出都有哪些塑料水果,且无计数问题。最近的测试中(2024.05.04)chatgpt 4和4o的多模态性能在同一张图片上变很差,可能是用了轻量模型了。

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

111294105 测试案例如图

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
huggingface的线上测试

备注 | Anything else?

No response

Cuiunbo commented 4 months ago

Hello, the question you raise is fascinating. Here's the thing, for llavauhd, it solved two problems,

  1. Input HD original-size images,
  2. No overlapping parts of the input images, (Which gpt4V will) But our model still has some base hallucinations, like the objhall benchmark's hallucination scores in the table are not zero. Both Geminipro and gpt4V still have some of these hallucinations, which is a problem worth addressing, we'll keep on looking for a solution, so stay tuned!