X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
https://www.modelscope.cn/studios/damo/mPLUG-Owl
MIT License
2.18k stars 169 forks source link

请问 mPLUG-Owl要多少显存,和minigpt4的13b模型比起来效果如何 #9

Closed wacdev closed 1 year ago

wacdev commented 1 year ago

请问 mPLUG-Owl要多少显存,和minigpt4的13b模型比起来效果如何

butyuhao commented 1 year ago

The mPLUG-Owl requires around 16GB of GPU memory. In terms of performance, it outperforms MiniGPT4 in various aspects such as multi-lingual understanding, multi-photo understanding, OCR, and more. You can find this information in the research paper or by trying out the demo.

wacdev commented 1 year ago

您好,我阅读了论文,但是论文没提到minigpt是用的是 Vicuna-7B还是 Vicuna-13B的模型。 另外,评测结果中有对单张图片内容识别的准确度对比吗? 我是程序员,论文不是看得很明白。 我的主要使用场景是给图片生成文本描述方便搜索,想调研下用什么方案效果最好。

wacdev commented 1 year ago

这张图

我测试了下,貌似MiniGPT-4 + 13B 对单张图片生成详细描述的效果好一点

mPLUG-Owl 结果如下图

image

MiniGPT-4 + 13B结果如下图

Generate a very detailed description for this image, description start with "This image shows "

< The image is of a young woman standing in a room with pink lingerie and high heels. She has long black hair, a slim body and a pouty lip. She is wearing a pink bralette with matching panties and high heels. The woman is posing with her hands on her hips and a seductive look on her face. The room is dimly lit with a black and white check erboard floor and a dark ceiling. There are no other people or objects in the room. The woman looks confident and beautiful in her outfit, ready to take on the world.

Describe this image in a simple sentence, description start with "The image depicts "

< a young woman wearing pink lingerie and high heels in a dimly lit room.

Tag this image, the tags are separated by commas

< Asian, young woman, lingerie, high heels, dimly lit room, confident.

1113.2297451496124 s/iter

YiyangZhou commented 1 year ago

a young woman wearing pink lingerie and high heels in a dimly lit room.

(1) The minigpt4 used in the evaluation in this paper is Vicuna-13B model. (2) The cases we used for comparison (OwlEval) and the evaluation score (A-D) of each model will be updated to our github later. It contains many cases that describe and identify the content of a single image. (3) We use manual marking as the specific evaluation method, and the evaluation criterion is the method adopted in the article "self-instruct" (their github: https://github.com/yizhongw/self-instruct), which is specifically mentioned in the OwlEval section on the sixth page of our paper.

wacdev commented 1 year ago

mPLUG-Owl论文机器翻译版.pdf 分享下论文机器翻译版,方便有搜索到这个帖子的人看。

wacdev commented 1 year ago

@YiyangZhou

你看 mPLUG-Owl 对这两个提问的回应差不多 Generate a very detailed description for this image, description start with "This image shows " Describe this image in a simple sentence, description start with "The image depicts "

而 MiniGPT-4 就知道一个详细回答,一个简略的回答 我感觉这个应该是大语言模型底座的区别 不知道mPLUG-Owl能不能出一个13B的版本

https://stability.ai/blog/stablevicuna-open-source-rlhf-chatbot stablevicuna 貌似效果会比较 Vicuna-13B 更好一点

YiyangZhou commented 1 year ago

@YiyangZhou

你看 mPLUG-Owl 对这两个提问的回应差不多 Generate a very detailed description for this image, description start with "This image shows " Describe this image in a simple sentence, description start with "The image depicts "

而 MiniGPT-4 就知道一个详细回答,一个简略的回答 我感觉这个应该是大语言模型底座的区别 不知道mPLUG-Owl能不能出一个13B的版本

https://stability.ai/blog/stablevicuna-open-source-rlhf-chatbot stablevicuna 貌似效果会比较 Vicuna-13B 更好一点

It is true that minigpt4 can give a more detailed description, but it will also produce more illusions (such as the non-existent high heels, house, etc.) but compared to mplug_owl for this picture there are relatively few illusions, and more descriptions may come from a more powerful language base, which may also lead to more illusions if the visual side is not strong. Our larger language models such as 13b will be open source in the future, so stay tuned!

wacdev commented 1 year ago
image

@YiyangZhou 你仔细看这张图,mPLUG-Owl上面也有high heels,这个算和MiniGPT-4打平。

minigpt4给出了一些dimly lit的描写,我感觉这个比较贴切,虽然应该是街道不是房间,这个就正负抵消了。

mPLUG-Owl 第二个回答"The image portrays a beautiful woman wearing a pink bikini and matching high heels, posing confidently with arms crossed on a white background."写的 white background,明显是错误的。

这样视觉识别上算是平手(个人认为)。

但是minigpt4理解了detailed description这个指令,算是加分项。

YiyangZhou commented 1 year ago
image

@YiyangZhou 你仔细看这张图,mPLUG-Owl上面也有high heels,这个算和MiniGPT-4打平。

minigpt4给出了一些dimly lit的描写,我感觉这个比较贴切,虽然应该是街道不是房间,这个就正负抵消了。

mPLUG-Owl 第二个回答"The image portrays a beautiful woman wearing a pink bikini and matching high heels, posing confidently with arms crossed on a white background."写的 white background,明显是错误的。

这样视觉识别上算是平手(个人认为)。

但是minigpt4理解了detailed description这个指令,算是加分项。

Or maybe they used vit giant, we only used vit large.

youyuanrsq commented 1 year ago

您好,我阅读了论文,但是论文没提到minigpt是用的是 Vicuna-7B还是 Vicuna-13B的模型。 另外,评测结果中有对单张图片内容识别的准确度对比吗? 我是程序员,论文不是看得很明白。 我的主要使用场景是给图片生成文本描述方便搜索,想调研下用什么方案效果最好。

我最近也在做的个人项目和你的想法类似,就是给个人相册里的所有图片生成一个文本描述,同时还能够识别出相册中的人物是谁。思路大概是先用几张人物图片训练一个DreamBooth,然后生成定制化的训练集再用BLIP生成图像caption,然后再用生成的图片文本对去微调MiniGPT-4或者mPLUG-Owl,最后用来对相册中的所有图片生成文本描述。回到你这个场景,我觉得搜索的话肯定关键词越多越好,使用MiniGPT-4和mPLUG-Owl会比用BLIP效果好。