Closed ff1Zzd closed 1 year ago
Maybe try MAGAer13/mplug-owl-llama-7b-video
. The HF version is the video version. But the performances are similar. For more stable support, it is recommend to use MAGAer13/mplug-owl-llama-7b
.
Maybe try
MAGAer13/mplug-owl-llama-7b-video
. The HF version is the video version. But the performances are similar. For more stable support, it is recommend to useMAGAer13/mplug-owl-llama-7b
.
Hi thanks for your kind reply. May I know how shall I initialise the model and processors if I would like to use the video version weight but still use image as the input (instead of using video).
Currently I modified the path of the pretrained_ckpt
and I importedMplugOwlForConditionalGeneration
from mplug_owl_video.modeling_mplug_owl
which is the video version. Everything else I followed strictly as the codes in the Run Model with Huggingface Style
.
However, I am getting empty string as the output of the decoder. Can anyone kindly point me which step I am missing. Thanks for the help. Here is the code I am using.
from mplug_owl_video.modeling_mplug_owl import MplugOwlForConditionalGeneration
from transformers import AutoTokenizer
from mplug_owl.processing_mplug_owl import MplugOwlImageProcessor, MplugOwlProcessor
import torch
pretrained_ckpt = 'MAGAer13/mplug-owl-llama-7b-video'
model = MplugOwlForConditionalGeneration.from_pretrained(
pretrained_ckpt,
torch_dtype=torch.bfloat16,
)
image_processor = MplugOwlImageProcessor.from_pretrained(pretrained_ckpt)
tokenizer = AutoTokenizer.from_pretrained(pretrained_ckpt)
processor = MplugOwlProcessor(image_processor, tokenizer)`
`prompts = [
'''The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: <image>
Human: <PROMPTS>
AI: ''']
image_list = ['20230613164509_image_1.jpg']
import time
generate_kwargs = {
'do_sample': False,
'top_k': 1,
'max_length': 512,
'temperature': 0.1,
'top_p': 0.9,
'length_penalty': 1,
'num_beams': 1,
'no_repeat_ngram_size': 2
}
from PIL import Image
images = [Image.open(_) for _ in image_list]
inputs = processor(text=prompts, images=images, return_tensors='pt')
inputs = {k: v.bfloat16() if v.dtype == torch.float else v for k, v in inputs.items()}
inputs = {k: v.to(model.device) for k, v in inputs.items()}
start = time.time()
with torch.no_grad():
res = model.generate(**inputs, **generate_kwargs)
sentence = tokenizer.decode(res.tolist()[0], skip_special_tokens=True)
print(time.time() - start)
print(sentence)`
Yes.
Yes.
Thanks for your kind reply. Right now I am exploring the possibility of leveraging your work to do an image classification task, and I am observing a significant difference in terms of the output generated from model run on my local machine and the output generated from HF demo.
For instance, for the same image and same text prompt, the output generated from model run my local machine is Reject the image. There is a blurred area, which indicates that the image has been manipulated or altered in some way.
. While the output generated from HF demo is Pass. No irregularities or occlusions are detected in the image
which is a different label (pass) compared to the other output (reject).
The model run on my machine is using the MAGAer13/mplug-owl-llama-7b
checkpoint and I initialised my model as the image attached above.
May I know how shall I fully replicate the performance on HF since the performances are quite different in my case.
See #101
Hi I am currently following the instruction of
Run Model with Huggingface Styple
on Github. However I am observing different performance between my local model and the one on HF.May I know which ckpt shall I use to achieve a similar performance on HF? Is it the
MAGAer13/mplug-owl-llama-7b
? Also I would like to clarify that i do not need to do any image preprocessing right since there is animage_processor
already.I have attached how I initialise the model here. Thanks for the amazing work and your kind help .