Query regarding model capabilities

Hey!

I just finished reading your paper -- amazing work and the results look awesome!

I had one query regarding your model capabilities -- As I understand, at inference time, you do take in the most similar images from the cached index and perform the attention over the image features as keys and values. I wanted to know if this model could be repurposed to do prompt-specific image captioning for a given image. For example, given an image of an elephant standing near a lake next to a tree, can the model be prompted with something like "Describe the background of the image" or "In the distance, we can see" to output a caption that solely describes the background of the image (lake and tree) rather than the foreground consisting of the elephant. Since your model is trained auto-regressively, it seems to me that this should be feasible. Please let me know your thoughts!

Victorwz / VaLM

Query regarding model capabilities #1