eric-ai-lab / MiniGPT-5

Official implementation of paper "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens"
https://eric-ai-lab.github.io/minigpt-5.github.io/
Apache License 2.0
844 stars 52 forks source link

Poor generation results (with normal text and image outputs) #44

Closed TongLiu-github closed 6 months ago

TongLiu-github commented 6 months ago

Hi, Your work will be highly cited.

When I run playground.py, it generates the following results.

image image

The only change from me is that "self.image_pipeline.to(self.device, PRECISION)" -> "self.image_pipeline.to(self.device)" Since otherwise, it leads to errors: "TypeError: to() takes from 1 to 2 positional arguments but 3 were given"

The following is the log that I have when running the code: Seed set to 42 Loading VIT Loading VIT Done Loading Q-Former Loading Q-Former Done Loading LLAMA You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.95s/it]Loading LLAMA Done Load BLIP2-LLM Checkpoint: ./config/prerained_minigpt4_7b.pth text_encoder/pytorch_model.fp16.safetensors not found Fetching 16 files: 100%|██████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 16648.19it/s]/opt/conda/envs/minigpt5/lib/python3.9/site-packages/transformers/generation/utils.py:1270: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation ) warnings.warn( Generated text: [IMG0] [IMG7] [IMG0] ' [IMG0] their [IMG0] ihrer to me what thet, [IMG0] leur sont ihrr a [IMG0] eux them there that they have been [IMG0] ihnen who him est her his [IMG0] lui ihm he knew. sizing of themselves\, [IMG0] quien er sein hé wastheir loro manière in de seu. their étaient zijn làb _ [IMG0] Their'_s [IMG0] deren_they était à érthé ihre were on [IMG0] 的 ellos [IMG0] hers at it è your are an étant eren had died [IMG0] på [IMG0] whose ép eran cél son era thé års éeʁ� space sua∂._Picame one\? [IMG0] whom she is_er их essereт у pénêtrement [IMG0] erano haar [IMG0] when der ils där careerés le témère than étéιם from fait méa Théria! [IMG0] 452 [IMG0] -toème its own fé détéré vériti and eenérném [IMG0] their están being\ερן" about you [IMG0] -ed himself déjà< for many… They hade said élőрσи в theorem을.< One might be having arrived; [IMG0] estaba involved___ thing [IMG0] made the seus mědété que être un péner rég—himselfétait býlэ , perché celuiה들αre� 100%|████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:02<00:00, 18.50it/s]/home/VD/tong/mmreasoning/MiniGPT-5/./examples/playground.py:81: UserWarning: Glyph 30340 (\N{CJK UNIFIED IDEOGRAPH-7684}) missing from current font. plt.savefig(os.path.join(currentdir, f'test{i}.png'), bbox_inches='tight') /home/VD/tong/mmreasoning/MiniGPT-5/./examples/playground.py:81: UserWarning: Glyph 51012 (\N{HANGUL SYLLABLE EUL}) missing from current font. plt.savefig(os.path.join(currentdir, f'test{i}.png'), bbox_inches='tight') /home/VD/tong/mmreasoning/MiniGPT-5/./examples/playground.py:81: UserWarning: Glyph 46308 (\N{HANGUL SYLLABLE DEUL}) missing from current font. plt.savefig(os.path.join(currentdir, f'test{i}.png'), bboxinches='tight') Generated text: [IMG0] [IMG7] [IMG0] and [IMG0] de was [IMG0] were their seat, [IMG0] there in the bq [IMG0] your [IMG0] she était involved [IMG0] you for him zijnb q bý is it een [IMG0] hering qu's bar estaba his été one of lui [IMG0] deren he had been a là quelle étaient its sein\him.## there are many měre sich dessen? [IMG0] thater they would have è théirheimt.``` [IMG0] leuré eux to be héi meme themselves. [IMG0] those sont seu… loro erano în quel_, där_mε étant; being é méthèmeréd by Me [IMG0] Thea membre who stée être leurs car era 1 [IMG0] One Man [IMG0] There termed [IMG0] Their Party [IMG0] -à [IMG0] has [IMG0] Théเתה명�� from scene [IMG0] Étه knew about them was herselfhis name [IMG0] They'd himself в this moment célhey的 scène [IMG0] their hétself à elle его fame—they�� , which made hisэρ들יσе, [IMG0] на la ép y léς on elـe pén son — sua_; me��вן i원을 suo= [IMG0] That man' story told everyone but ihn,---\the air\θ [IMG0] whose family 100%|████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:03<00:00, 16.30it/s] ...

Is there any thing that I could do to improve the result?

KzZheng commented 6 months ago

From your results, it seems you do not set IS_STAGE2=True as environment variables. You can check issue #10 for more discussion.

TongLiu-github commented 6 months ago

From your results, it seems you do not set IS_STAGE2=True as environment variables. You can check issue #10 for more discussion.

woops, thank you! Now everything fine :)

TongLiu-github commented 6 months ago

Now the examples in playground.py worked. But maybe one more question regarding the generation, I found quite often, minigpt5 does not generate the text and only generate the image. Is there any thing that I could do to improve the result? e.g., Input: image

Results: image

where the text part is only \<unk>.

KzZheng commented 6 months ago

Since our model only finetuned on the VIST dataset with templated instructions, the model's generalizability is limited. It is an interesting follow-up direction about how to improve the model's usage of pretrained commonsense knowledge.