csuhan / OneLLM

[CVPR 2024] OneLLM: One Framework to Align All Modalities with Language
Other
593 stars 32 forks source link

Model not producing accurate captions #27

Open imartinf opened 3 months ago

imartinf commented 3 months ago

Hi! I have been having some trouble to get the repo and models working. Specially, I tried to run the evaluation scripts (specifically COCO captioning) as reported in the README using the checkpoint that is available at the huggingface hub (https://huggingface.co/csuhan/OneLLM-7B). I'm using a A500 24GB GPU for inference.

The CIDEr result I get is 0.02, much lower than expected taking into account that the model is trained on MS COCO data. The captions are not accurate and lack variability (I pasted some examples below). Moreover, it consistently refers to the images being black and white. I doubled check that they are downloaded properly and I used the code as-is after only adapting the paths. Is the checkpoint use-ready and adequate for finetuning on additional tasks? Is there any step missing from the repo docs that I should be doing?

Please feel free to request additional information about my setup that might be relevant to the problem.

Thanks!

{
        "image_id": 184613,
        "caption": "A close up of a black and white photo of a cat."
    },
    {
        "image_id": 403013,
        "caption": "A black and white photo of a long object."
    },
    {
        "image_id": 562150,
        "caption": "A black and white photo of a long object."
    },
    {
        "image_id": 360772,
        "caption": "A black and white photo of a long thin object."
    },
    {
        "image_id": 340559,
        "caption": "A black and white photo of a long object."
    },
    {
        "image_id": 321107,
        "caption": "A black and white photo of a black object."
    },
    {
        "image_id": 129001,
        "caption": "A black and white photo of a long object."
    },
    {
        "image_id": 556616,
        "caption": "A black and white photo of a long object."
    },
    {
        "image_id": 472621,
        "caption": "A black and white photo of a blurry object."
    },
    {
        "image_id": 364521,
        "caption": "A black and white photo of a black and white object."
    },
    {
        "image_id": 310391,
        "caption": "A black and white photo of a blank screen."
    },
weiqingxin913 commented 3 months ago

I also encountered the same problem, the difference is that I evaluated the IMU data, and the evaluation results were not very good. 11 Looking forward to the author's reply.

GitJacobFrye commented 3 months ago

I encountered the same problem when I used the mm_multi_turn.py demo to display a Gradio interface. I tested the model with several different pictures and prompts, such as "What is this picture about?" However, it always replied with the same answer, stating that the picture is black and nothing can be found in the images. I'm sure I have downloaded the correct weight files for OneLLM because when I running the demo scripts, it showed that all parameters were set accurately. Sad. Looking forward to the author's reply.

vakadanaveen commented 2 months ago

I am also facing the same issue. The output is always something like this- " The image features a close-up view of a black object, possibly a piece of machinery or a tool, with a distinctive square shape. The object appears to be made of metal, and its design suggests it could be a part of a larger assembly or system. The object's black color and square shape make it stand out against a white background, which is visible in the image."

csuhan commented 2 months ago

Hi @vakadanaveen @imartinf @weiqingxin913 @GitJacobFrye This bug is caused by an update of open_clip_torch package. Please set open_clip_torch==2.23.0 in your env.

Refer to an email from Dr Stephen Hausler:

What I did was download the Docker image for your OneLLM Huggingface Space and run the image locally. This worked, then I inspected the installed packages in pip and then worked through the packages one by one, updating our broken environment by pip installing the same package version as per the Docker image one by one.

In doing so I found that the package that caused the error was open-clip-torch. We had open-clip-torch 2.26.1, yet the Docker image had 2.23.0. Therefore my conclusion is that OneLLM requires open-clip-torch==2.23.0. We tested both image and audio modalities (which both were not working for us before) and now both modalities produce the correct output text for the input. Also following this fix also allows us to use https://github.com/csuhan/OneLLM including the official model https://huggingface.co/csuhan/OneLLM-7B.

vakadanaveen commented 2 months ago

Hi @csuhan . Thanks for the reply. After installing open-clip-torch==2.23.0 , the model is giving accurate captions.