Victorwz / VaLM

VaLM: Visually-augmented Language Modeling. ICLR 2023.
https://openreview.net/forum?id=8IN-qLkl215
54 stars 3 forks source link

Query regarding model capabilities #1

Open vishaal27 opened 2 years ago

vishaal27 commented 2 years ago

Hey!

I just finished reading your paper -- amazing work and the results look awesome!

I had one query regarding your model capabilities -- As I understand, at inference time, you do take in the most similar images from the cached index and perform the attention over the image features as keys and values. I wanted to know if this model could be repurposed to do prompt-specific image captioning for a given image. For example, given an image of an elephant standing near a lake next to a tree, can the model be prompted with something like "Describe the background of the image" or "In the distance, we can see" to output a caption that solely describes the background of the image (lake and tree) rather than the foreground consisting of the elephant. Since your model is trained auto-regressively, it seems to me that this should be feasible. Please let me know your thoughts!

Victorwz commented 2 years ago

Thanks for the great comments and ideas! We are currently working on adapting VaLM to vision-language tasks, especially image captioning and vqa. We would add more experimental results to the later version of VaLM. Thank you great again for your nice brainstorm with us!