X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
https://www.modelscope.cn/studios/damo/mPLUG-Owl
MIT License
2.25k stars 171 forks source link

How can I replicate the results on Hugging Face (i.e. which ckpt shall I use) #99

Closed ff1Zzd closed 1 year ago

ff1Zzd commented 1 year ago

Hi I am currently following the instruction of Run Model with Huggingface Styple on Github. However I am observing different performance between my local model and the one on HF.

May I know which ckpt shall I use to achieve a similar performance on HF? Is it the MAGAer13/mplug-owl-llama-7b? Also I would like to clarify that i do not need to do any image preprocessing right since there is an image_processor already.

I have attached how I initialise the model here. Thanks for the amazing work and your kind help .

image
vateye commented 1 year ago

Maybe try MAGAer13/mplug-owl-llama-7b-video. The HF version is the video version. But the performances are similar. For more stable support, it is recommend to use MAGAer13/mplug-owl-llama-7b.

ff1Zzd commented 1 year ago

Maybe try MAGAer13/mplug-owl-llama-7b-video. The HF version is the video version. But the performances are similar. For more stable support, it is recommend to use MAGAer13/mplug-owl-llama-7b.

Hi thanks for your kind reply. May I know how shall I initialise the model and processors if I would like to use the video version weight but still use image as the input (instead of using video).

Currently I modified the path of the pretrained_ckpt and I importedMplugOwlForConditionalGeneration from mplug_owl_video.modeling_mplug_owl which is the video version. Everything else I followed strictly as the codes in the Run Model with Huggingface Style.

However, I am getting empty string as the output of the decoder. Can anyone kindly point me which step I am missing. Thanks for the help. Here is the code I am using.

from mplug_owl_video.modeling_mplug_owl import MplugOwlForConditionalGeneration
from transformers import AutoTokenizer
from mplug_owl.processing_mplug_owl import MplugOwlImageProcessor, MplugOwlProcessor
import torch

pretrained_ckpt = 'MAGAer13/mplug-owl-llama-7b-video'
model = MplugOwlForConditionalGeneration.from_pretrained(
    pretrained_ckpt,
    torch_dtype=torch.bfloat16,
)
image_processor = MplugOwlImageProcessor.from_pretrained(pretrained_ckpt)
tokenizer = AutoTokenizer.from_pretrained(pretrained_ckpt)
processor = MplugOwlProcessor(image_processor, tokenizer)`

`prompts = [
'''The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: <image>
Human: <PROMPTS>
AI: ''']

image_list = ['20230613164509_image_1.jpg']

import time
generate_kwargs = {
    'do_sample': False,
    'top_k': 1,
    'max_length': 512,
    'temperature': 0.1,
    'top_p': 0.9,
    'length_penalty': 1,
    'num_beams': 1,
    'no_repeat_ngram_size': 2
}
from PIL import Image

images = [Image.open(_) for _ in image_list]
inputs = processor(text=prompts, images=images, return_tensors='pt')
inputs = {k: v.bfloat16() if v.dtype == torch.float else v for k, v in inputs.items()}
inputs = {k: v.to(model.device) for k, v in inputs.items()}
start = time.time()
with torch.no_grad():
    res = model.generate(**inputs, **generate_kwargs)
sentence = tokenizer.decode(res.tolist()[0], skip_special_tokens=True)
print(time.time() - start)
print(sentence)`
MAGAer13 commented 1 year ago

Yes.

ff1Zzd commented 1 year ago

Yes.

Thanks for your kind reply. Right now I am exploring the possibility of leveraging your work to do an image classification task, and I am observing a significant difference in terms of the output generated from model run on my local machine and the output generated from HF demo.

For instance, for the same image and same text prompt, the output generated from model run my local machine is Reject the image. There is a blurred area, which indicates that the image has been manipulated or altered in some way.. While the output generated from HF demo is Pass. No irregularities or occlusions are detected in the image which is a different label (pass) compared to the other output (reject).

The model run on my machine is using the MAGAer13/mplug-owl-llama-7b checkpoint and I initialised my model as the image attached above.

May I know how shall I fully replicate the performance on HF since the performances are quite different in my case.

MAGAer13 commented 1 year ago

See #101