Closed bakachan19 closed 1 year ago
Hi, the model on huggingface space is the advanced version of mPLUG-Owl which natively support video with temporal related module as input without treating video as multiple frames. Besides the 8bit precision also have impact on the results.
For hallucination issue, we are working on the improved version of it since hallucination is a common problem of LLM models.
Dear @MAGAer13, Thanks for the reply.
Is it possible to use the same model as the one on Huggingface but locally? I guess this would require more gpu, right?
We will release the video version ASAP. The computation is comparable with current version since only a small fraction of parameters are added.
Thanks for your answers and time. looking forward to it!
Hi. Thanks for this great work.
I've used the Huggingface demo to generate descriptions for some images with the following prompt:
I also used the 8bits model in colab. This is the code that I used to generate the descriptions:
However, the results from Huggingface demo are different from the locally runned model. For example, Huggingface will describe an image as:
The painting depicts a woman with her arms outstretched and wearing a red dress, standing in front of a brightly colored background with a vibrant rainbow-like design. The woman's pose appears confident and dynamic, as if she is ready to embrace the colorful surroundings. There are several other objects in the scene, including a potted plant located on the left side of the painting, a handbag situated near the bottom right corner, and a cup placed towards the right side. Additionally, there is a bowl on a stand near her right foot and another bow on her left arm, adding to the artwork' s vivid appearance.
But when I run the model on colab, for the same image, I obtain the following description:
The image is a painting featuring a colorful dog with a purple and green background. The dog's body is in the middle of the painting, while its head appears at the left side of the picture, slightly turned to the right. Its fur is a mix of purple, green, and brown, giving it a vibrant appearance. There are a few more dogs present in the background, but their focus is not as prominent as the main subject's. The background consists of various colors, including red, blue, yellow, orange, white, and purple, creating a visually engaging and lively composition. The overall painting has a cheerful and playful mood.
The second description is wrong, as there are no dogs in the image. I noticed that many descriptions generated when running the model on colab are completely out of concept. Is there a something that I am doing wrong? Could it be because the model is loaded differently?
I also noticed that even when using the Huggingface demo, the model hallucinates and includes elements in the description that are not present in the image. For example, in the first description there are no handbags, cups or bowls. For example, given the image of a statue it will start to describe how the statue is surrounded by people that are admiring the statue when there are no people or crowds in the image whatsoever.
Is there a way to control the hallucinations? And why are the results so different when I run the model in different environments (Huggingface vs colab)?
I apologize for the long post. Any help is greatly appreciated. Thank you!