Yuliang-Liu / Monkey

【CVPR 2024 Highlight】Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models
MIT License
1.82k stars 128 forks source link

Performance compared to llava? #5

Closed eddyrivers10 closed 6 months ago

eddyrivers10 commented 11 months ago

Hi authors, I ran images in your demo but I got a much worse results compared to Llava. The captions are very short even if I follow the same prompt for detailed description in the paper and I used the same image for both llava and this. The higher resolution also doesn't capture the small text in the correct location. Is there something wrong with the demo? I cannot get results close to anything shown in the paper.

Can you explain how you added perceiver resampler too? Since perceiver resampler is used for videos, is the temporal dimension used for the number of images?

Thanks in advance.

Yuliang-Liu commented 11 months ago

@eddyrivers10

Dear Eddy,

Could you kindly provide some images for us? We need them to identify the potential causes of the issue we're facing. We've conducted numerous comparisons with LLaVA 1.5 and believe that our results should at least match its performance. Your assistance in sharing these images would be greatly appreciated.

YULUOYUNZHI commented 11 months ago

I chose 5 images that occurred today for testing, which I uploaded for analysis. The first four are from today's 2023 League of Legends World Championship, and the fifth is a random shot taken for testing purposes. Based on the images I uploaded, I find that in most cases, Monkey testing performs better than Llava 1.5. image image image image image

eddyrivers10 commented 11 months ago

Hi, I just used the default image of the man ironing on the back of his car from the llava dem https://llava.hliu.cc/. When I asked for a detailed caption I get A man is ironing his shirt on the back of a car.

Second example I tried was the water image from the llava demo too. The caption received is A long wooden pier on a lake surrounded by trees..

The captions are very short compared to llava. However llava hallucinates more as the captions are longer. Does the demo use the same decoding length? I wonder how you can make Monkey give descriptive sentences without hallucinating. Thanks!

echo840 commented 11 months ago

T4)27088~8LUN~K7RKLVW)E 3BFF HB)(5} Q{%}K(19I0F 2P$Q)PQ5QVGCI4EIWS}ML Q U5~ P}SC5B~ Z)(G)(IIMXG

image Hello. You should click the “Generate” button to get a detailed description of the image. For the VQA task, we utilize the following format: "[question] Answer: [answer]." For the detailed caption task, we employ the format: "Generate the detailed caption in English: [caption]." If you provide the input 'Generate the detailed caption in English:' and click “submit” button, the model's input will be 'Generate the detailed caption in English: Answer:'. This input deviates from what we were trained on.

eddyrivers10 commented 11 months ago

Hi authors, sorry if I used it wrongly, I haven't had the chance to check the demo again since the web traffic is high now. But it looks like I have clicked on the wrong option which is why the captions are so short.

For the captions above it looks like they are shorter but more factual. Do you know if there's a comparison on hallucination versus llava and other larger resolution models? Thank you!

shipengai commented 11 months ago

@Yuliang-Liu 。In Monkey paper, Is “llava1.5 pretrained with CC3M” which the Table 6 says same with original llava1.5?这个模型训练数据是跟原始的llava1.5-13b一样吗

echo840 commented 11 months ago

Hello. Our experiments involved a comparative analysis where we pretrained LLaVA1.5 using the same original data from LLaVA, with the modification of replacing 427k text with our generated annotations. The instruction tuning data were exclusively sourced from LLaVA1.5.

shipengai commented 11 months ago

@echo840 Thanks for your reply. when pretrained by llava data, mmbench score is 68.3 that higher than llav1.5 which is 67.7。 This is an amazing result。