How do model output interleaved text-image with multimodal input?

eric-ai-lab / MiniGPT-5

Official implementation of paper "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens"

https://eric-ai-lab.github.io/minigpt-5.github.io/

Apache License 2.0

844 stars 52 forks source link

How do model output interleaved text-image with multimodal input? #52

Open URRealHero opened 2 weeks ago

URRealHero commented 2 weeks ago

Does the model require further finetune? I'm wondering why the playground use a 'for' loop to generate a story

URRealHero commented 2 weeks ago

What I mean is how can it generate multi images with each image a caption. If I use a for loop, wouldn't the model repeatedly generate similar scene again and again? How did the playground.py ensure it can generate a sequence of story? How to make a coherent multi-images generation?