X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
https://www.modelscope.cn/studios/damo/mPLUG-Owl
MIT License
2.33k stars 176 forks source link

different results between Huggingface and colab #60

Closed bakachan19 closed 1 year ago

bakachan19 commented 1 year ago

Hi. Thanks for this great work.

I've used the Huggingface demo to generate descriptions for some images with the following prompt:

Describe this image as detailed as possible.

I also used the 8bits model in colab. This is the code that I used to generate the descriptions:

import torch
from PIL import Image
import requests
from mplug_owl.modeling_mplug_owl import MplugOwlForConditionalGeneration
from mplug_owl.tokenization_mplug_owl import MplugOwlTokenizer
from mplug_owl.processing_mplug_owl import MplugOwlImageProcessor, MplugOwlProcessor

pretrained_ckpt = 'MAGAer13/mplug-owl-llama-7b'
model = MplugOwlForConditionalGeneration.from_pretrained(
    pretrained_ckpt,
    load_in_8bit = True,
    torch_dtype =  torch.half,
    device_map= 'auto'
)
image_processor = MplugOwlImageProcessor.from_pretrained(pretrained_ckpt)
tokenizer = MplugOwlTokenizer.from_pretrained(pretrained_ckpt)
processor = MplugOwlProcessor(image_processor, tokenizer)

prompts = [
'''The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: <image>
Human: Describe this image as detailed as possible.
AI: ''']

image_list = ["/path_to_image"]

generate_kwargs = {
    'do_sample': True,
    'top_k': 5,
    'max_length': 512
}
from PIL import Image
import requests

images = [Image.open(_) for _ in image_list]
inputs = processor(text=prompts, images=images, return_tensors='pt')
inputs = {k: v.bfloat16() if v.dtype == torch.float else v for k, v in inputs.items()}
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
    res = model.generate(**inputs, **generate_kwargs)
sentence = tokenizer.decode(res.tolist()[0], skip_special_tokens=True)
print(sentence)

However, the results from Huggingface demo are different from the locally runned model. For example, Huggingface will describe an image as:

The painting depicts a woman with her arms outstretched and wearing a red dress, standing in front of a brightly colored background with a vibrant rainbow-like design. The woman's pose appears confident and dynamic, as if she is ready to embrace the colorful surroundings. There are several other objects in the scene, including a potted plant located on the left side of the painting, a handbag situated near the bottom right corner, and a cup placed towards the right side. Additionally, there is a bowl on a stand near her right foot and another bow on her left arm, adding to the artwork' s vivid appearance.

But when I run the model on colab, for the same image, I obtain the following description:

The image is a painting featuring a colorful dog with a purple and green background. The dog's body is in the middle of the painting, while its head appears at the left side of the picture, slightly turned to the right. Its fur is a mix of purple, green, and brown, giving it a vibrant appearance. There are a few more dogs present in the background, but their focus is not as prominent as the main subject's. The background consists of various colors, including red, blue, yellow, orange, white, and purple, creating a visually engaging and lively composition. The overall painting has a cheerful and playful mood.

The second description is wrong, as there are no dogs in the image. I noticed that many descriptions generated when running the model on colab are completely out of concept. Is there a something that I am doing wrong? Could it be because the model is loaded differently?

I also noticed that even when using the Huggingface demo, the model hallucinates and includes elements in the description that are not present in the image. For example, in the first description there are no handbags, cups or bowls. For example, given the image of a statue it will start to describe how the statue is surrounded by people that are admiring the statue when there are no people or crowds in the image whatsoever.

Is there a way to control the hallucinations? And why are the results so different when I run the model in different environments (Huggingface vs colab)?

I apologize for the long post. Any help is greatly appreciated. Thank you!

MAGAer13 commented 1 year ago

Hi, the model on huggingface space is the advanced version of mPLUG-Owl which natively support video with temporal related module as input without treating video as multiple frames. Besides the 8bit precision also have impact on the results.

For hallucination issue, we are working on the improved version of it since hallucination is a common problem of LLM models.

bakachan19 commented 1 year ago

Dear @MAGAer13, Thanks for the reply.

Is it possible to use the same model as the one on Huggingface but locally? I guess this would require more gpu, right?

MAGAer13 commented 1 year ago

We will release the video version ASAP. The computation is comparable with current version since only a small fraction of parameters are added.

bakachan19 commented 1 year ago

Thanks for your answers and time. looking forward to it!