TRI-ML / prismatic-vlms

A flexible and efficient codebase for training visually-conditioned language models (VLMs)
MIT License
327 stars 93 forks source link

Does it support input format of multiple images + text? #3

Closed swj0419 closed 4 months ago

swj0419 commented 4 months ago

Thanks for open sourcing the codebase and replying to my previous issue promptly! I was wondering if the codebase supports the input format of multiple images + text (<text, image, text, image>)? If not, what adjustment would you suggest to accomodate that format? I am looking into the code and it seems that this line concatenates unimodel and multimodal features together?

siddk commented 4 months ago

We don't currently support multiple images, but you've identified the key line to change to accommodate for that! Basically, you'd want to update the forward pass logic to take in a list of pixel_values, and define a separate list of offsets which tells you where to insert the image patches in the text sequence. Then you'd change the line you mentioned to insert the image patches at the right locations.

You will probably also want to define a different Dataset class for the multiple image/text format (see here), as well as update the default batch collation logic.

swj0419 commented 4 months ago

Thank you so much!!