UCSC-VLAA / Recap-DataComp-1B

This is the official repository of our paper "What If We Recaption Billions of Web Images with LLaMA-3 ?"
https://www.haqtu.me/Recap-Datacomp-1B/
120 stars 1 forks source link

The descriptions start with "The image depicts/features" instead of going directly to the object #5

Closed TonyLianLong closed 4 months ago

TonyLianLong commented 4 months ago

Thanks for the great work! I'm trying out your prompt with a llava hf space. However, rather than directly getting to the point ("A [object name] ...", which is often the case in your Fig. 1), the model's outputs often start with "the image ...". Is there anything I missed when prompting?

image
LinB203 commented 4 months ago

same question.

ImKeTT commented 4 months ago

Which conversation template are you guys using? We employed the LLaMA3 conversation template for our LLaMA3-powered LLaVA.

cihangxie commented 4 months ago

I do not think it relates to the prompt; rather, I believe this behavior may partially relate to the fact that we also finetune LLaVA with the HQ-Edit dataset, which can help to avoid the "The image"-like outputs. You can try our reception model (with your own prompt) here

cihangxie commented 4 months ago

I close this issue for now; but feel free to reopen it if you still meet this issue with our reception model.