DLYuanGod / TinyGPT-V

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
BSD 3-Clause "New" or "Revised" License
1.24k stars 76 forks source link

Phi-2 checkpoint in the readme does not fully initialize the Phi-2 model #10

Open VovaTch opened 9 months ago

VovaTch commented 9 months ago

Hi, I'm trying to run the demo.py script for model stage 3. I've followed your instructions for replacing the modeling_phi.py file in transformers/models/ with the one in the repo, downloaded the Phi-2 repo linked, placed all files in weights/phi-2/ folder, and replaced the mentioned path in the 3 files with weights/phi-2/. I get the following warning, and the model doesn't perform well when running the demo:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Initializing Chat
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of PhiForCausalLM were not initialized from the model checkpoint at weights/phi-2/ and are newly initialized: ['model.layers.30.self_attn.k_layernorm.bias', 'model.layers.19.post_layernorm.weight', 'model.layers.30.post_layernorm.weight', 'model.layers.15.self_attn.q_layernorm.weight', 'model.layers.13.self_attn.k_layernorm.bias', 'model.layers.10.self_attn.k_layernorm.bias', 'model.layers.2.self_attn.k_layernorm.bias', 'model.layers.5.post_layernorm.weight', 'model.layers.20.self_attn.q_layernorm.weight', 'model.layers.1.self_attn.q_layernorm.weight', 'model.layers.1.self_attn.q_layernorm.bias', 'model.layers.29.self_attn.k_layernorm.bias', 'model.layers.7.self_attn.k_layernorm.bias', 'model.layers.5.self_attn.k_layernorm.bias', 'model.layers.21.self_attn.k_layernorm.weight', 'model.layers.24.post_layernorm.weight', 'model.layers.6.self_attn.k_layernorm.weight', 'model.layers.2.self_attn.q_layernorm.bias', 'model.layers.9.post_layernorm.weight', 'model.layers.6.self_attn.k_layernorm.bias', 'model.layers.22.self_attn.q_layernorm.bias', 'model.layers.11.post_layernorm.weight', 'model.layers.18.post_layernorm.weight', 'model.layers.25.self_attn.k_layernorm.bias', 'model.layers.19.self_attn.k_layernorm.bias', 'model.layers.29.post_layernorm.weight', 'model.layers.23.self_attn.k_layernorm.weight', 'model.layers.26.self_attn.k_layernorm.weight', 'model.layers.0.post_layernorm.weight', 'model.layers.28.self_attn.k_layernorm.weight', 'model.layers.2.self_attn.q_layernorm.weight', 'model.layers.18.self_attn.q_layernorm.bias', 'model.layers.22.self_attn.q_layernorm.weight', 'model.layers.28.self_attn.k_layernorm.bias', 'model.layers.16.self_attn.k_layernorm.bias', 'model.layers.12.self_attn.q_layernorm.weight', 'model.layers.30.self_attn.q_layernorm.bias', 'model.layers.4.self_attn.q_layernorm.weight', 'model.layers.10.self_attn.k_layernorm.weight', 'model.layers.23.self_attn.q_layernorm.bias', 'model.layers.10.post_layernorm.weight', 'model.layers.11.self_attn.k_layernorm.bias', 'model.layers.8.post_layernorm.weight', 'model.layers.15.self_attn.k_layernorm.bias', 'model.layers.7.self_attn.k_layernorm.weight', 'model.layers.13.self_attn.k_layernorm.weight', 'model.layers.29.self_attn.q_layernorm.weight', 'model.layers.0.self_attn.q_layernorm.bias', 'model.layers.4.self_attn.k_layernorm.bias', 'model.layers.23.post_layernorm.weight', 'model.layers.7.self_attn.q_layernorm.weight', 'model.layers.8.self_attn.q_layernorm.bias', 'model.layers.3.post_layernorm.weight', 'model.layers.12.post_layernorm.weight', 'model.layers.28.self_attn.q_layernorm.weight', 'model.layers.28.self_attn.q_layernorm.bias', 'model.layers.31.post_layernorm.weight', 'model.layers.3.self_attn.q_layernorm.bias', 'model.layers.3.self_attn.k_layernorm.weight', 'model.layers.16.self_attn.q_layernorm.bias', 'model.layers.9.self_attn.k_layernorm.weight', 'model.layers.0.self_attn.q_layernorm.weight', 'model.layers.27.self_attn.q_layernorm.weight', 'model.layers.10.self_attn.q_layernorm.weight', 'model.layers.16.self_attn.k_layernorm.weight', 'model.layers.9.self_attn.k_layernorm.bias', 'model.layers.11.self_attn.k_layernorm.weight', 'model.layers.17.self_attn.k_layernorm.bias', 'model.layers.3.self_attn.q_layernorm.weight', 'model.layers.18.self_attn.k_layernorm.bias', 'model.layers.20.self_attn.q_layernorm.bias', 'model.layers.9.self_attn.q_layernorm.bias', 'model.layers.25.self_attn.q_layernorm.weight', 'model.layers.0.self_attn.k_layernorm.bias', 'model.layers.16.post_layernorm.weight', 'model.layers.14.self_attn.k_layernorm.bias', 'model.layers.25.self_attn.q_layernorm.bias', 'model.layers.27.self_attn.q_layernorm.bias', 'model.layers.13.self_attn.q_layernorm.weight', 'model.layers.6.self_attn.q_layernorm.bias', 'model.layers.12.self_attn.q_layernorm.bias', 'model.layers.4.self_attn.k_layernorm.weight', 'model.layers.27.post_layernorm.weight', 'model.layers.17.self_attn.q_layernorm.weight', 'model.layers.31.self_attn.k_layernorm.bias', 'model.layers.19.self_attn.q_layernorm.bias', 'model.layers.1.post_layernorm.weight', 'model.layers.24.self_attn.k_layernorm.weight', 'model.layers.31.self_attn.k_layernorm.weight', 'model.layers.12.self_attn.k_layernorm.bias', 'model.layers.30.self_attn.k_layernorm.weight', 'model.layers.27.self_attn.k_layernorm.bias', 'model.layers.31.self_attn.q_layernorm.weight', 'model.layers.22.self_attn.k_layernorm.bias', 'model.layers.21.self_attn.q_layernorm.weight', 'model.layers.15.self_attn.k_layernorm.weight', 'model.layers.8.self_attn.q_layernorm.weight', 'model.layers.26.self_attn.q_layernorm.weight', 'model.layers.17.post_layernorm.weight', 'model.layers.24.self_attn.q_layernorm.weight', 'model.layers.13.post_layernorm.weight', 'model.layers.3.self_attn.k_layernorm.bias', 'model.layers.26.post_layernorm.weight', 'model.layers.1.self_attn.k_layernorm.weight', 'model.layers.0.self_attn.k_layernorm.weight', 'model.layers.18.self_attn.q_layernorm.weight', 'model.layers.4.self_attn.q_layernorm.bias', 'model.layers.23.self_attn.q_layernorm.weight', 'model.layers.20.post_layernorm.weight', 'model.layers.22.post_layernorm.weight', 'model.layers.17.self_attn.q_layernorm.bias', 'model.layers.22.self_attn.k_layernorm.weight', 'model.layers.24.self_attn.q_layernorm.bias', 'model.layers.20.self_attn.k_layernorm.bias', 'model.layers.10.self_attn.q_layernorm.bias', 'model.layers.21.self_attn.q_layernorm.bias', 'model.layers.11.self_attn.q_layernorm.weight', 'model.layers.26.self_attn.k_layernorm.bias', 'model.layers.21.post_layernorm.weight', 'model.layers.4.post_layernorm.weight', 'model.layers.5.self_attn.q_layernorm.bias', 'model.layers.8.self_attn.k_layernorm.weight', 'model.layers.5.self_attn.q_layernorm.weight', 'model.layers.6.post_layernorm.weight', 'model.layers.15.self_attn.q_layernorm.bias', 'model.layers.23.self_attn.k_layernorm.bias', 'model.layers.2.self_attn.k_layernorm.weight', 'model.layers.27.self_attn.k_layernorm.weight', 'model.layers.15.post_layernorm.weight', 'model.layers.28.post_layernorm.weight', 'model.layers.21.self_attn.k_layernorm.bias', 'model.layers.7.post_layernorm.weight', 'model.layers.1.self_attn.k_layernorm.bias', 'model.layers.20.self_attn.k_layernorm.weight', 'model.layers.25.self_attn.k_layernorm.weight', 'model.layers.19.self_attn.q_layernorm.weight', 'model.layers.29.self_attn.k_layernorm.weight', 'model.layers.9.self_attn.q_layernorm.weight', 'model.layers.26.self_attn.q_layernorm.bias', 'model.layers.24.self_attn.k_layernorm.bias', 'model.layers.16.self_attn.q_layernorm.weight', 'model.layers.14.self_attn.q_layernorm.weight', 'model.layers.17.self_attn.k_layernorm.weight', 'model.layers.11.self_attn.q_layernorm.bias', 'model.layers.12.self_attn.k_layernorm.weight', 'model.layers.25.post_layernorm.weight', 'model.layers.31.self_attn.q_layernorm.bias', 'model.layers.14.post_layernorm.weight', 'model.layers.7.self_attn.q_layernorm.bias', 'model.layers.18.self_attn.k_layernorm.weight', 'model.layers.6.self_attn.q_layernorm.weight', 'model.layers.14.self_attn.q_layernorm.bias', 'model.layers.29.self_attn.q_layernorm.bias', 'model.layers.2.post_layernorm.weight', 'model.layers.8.self_attn.k_layernorm.bias', 'model.layers.30.self_attn.q_layernorm.weight', 'model.layers.5.self_attn.k_layernorm.weight', 'model.layers.19.self_attn.k_layernorm.weight', 'model.layers.14.self_attn.k_layernorm.weight', 'model.layers.13.self_attn.q_layernorm.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
trainable params: 31457280 || all params: 2811233280 || trainable%: 1.1189850455953623`.

I'm running a corporate offshoot of Ubuntu 20.04 if it is of any significance.

My minigpt_v2.yaml:

model:
  arch: minigpt_v2

  # vit encoder
  image_size: 448
  drop_path_rate: 0
  use_grad_checkpoint: False
  vit_precision: "fp16"
  freeze_vit: True

  # generation configs
  prompt: ""

  llama_model: "weights/phi-2/"
  lora_r: 64
  lora_alpha: 16

preprocess:
    vis_processor:
        train:
          name: "blip2_image_train"
          image_size: 448
        eval:
          name: "blip2_image_eval"
          image_size: 448
    text_processor:
        train:
          name: "blip_caption"
        eval:
          name: "blip_caption"

My minigpt4_vicana0.yaml:

model:
  arch: minigpt4

  # vit encoder
  image_size: 224
  drop_path_rate: 0
  use_grad_checkpoint: False
  vit_precision: "fp16"
  freeze_vit: True
  freeze_qformer: True

  # Q-Former
  num_query_token: 32

  # generation configs
  prompt: ""

  llama_model: "weights/phi-2/"

preprocess:
    vis_processor:
        train:
          name: "blip2_image_train"
          image_size: 224
        eval:
          name: "blip2_image_eval"
          image_size: 224
    text_processor:
        train:
          name: "blip_caption"
        eval:
          name: "blip_caption"

My line 16 in conversation.conversation.py:

tokenizer = AutoTokenizer.from_pretrained("weights/phi-2/")

Maybe it's a planned behavior? If so, please clarify.

DLYuanGod commented 9 months ago

Thank you for your interest in our work.

It is normal to receive this warning. This is because the weights for normalization are stored in the files for each stage's pth.

I don't know what makes the model perform poorly when reasoning, can you please give me specific examples?

VovaTch commented 9 months ago

For a simple example, I have this image of my cat: Daisy And the prompt is What are the colors of this cat? I get this messy response: image In addition, I get this message when running the model in my terminal:

Running on local URL:  http://127.0.0.1:7860

Could not create share link. Missing file: /home/tcv1tv/anaconda3/envs/tinygptv/lib/python3.9/site-packages/gradio/frpc_linux_amd64_v0.2. 

Please check your internet connection. This can happen if your antivirus software blocks the download of this file. You can install manually by following these steps: 

1. Download this file: https://cdn-media.huggingface.co/frpc-gradio-0.2/frpc_linux_amd64
2. Rename the downloaded file to: frpc_linux_amd64_v0.2
3. Move the file to this location: /home/tcv1tv/anaconda3/envs/tinygptv/lib/python3.9/site-packages/gradio
/home/tcv1tv/anaconda3/envs/tinygptv/lib/python3.9/site-packages/gradio/helpers.py:818: UserWarning: Using the update method is deprecated. Simply return a new object instead, e.g. `return gr.Textbox(...)` instead of `return gr.update(...)
  warnings.warn(
/home/tcv1tv/anaconda3/envs/tinygptv/lib/python3.9/site-packages/gradio/components/image.py:193: UserWarning: Using the update method is deprecated. Simply return a new object instead, e.g. `return gr.Image(...)` instead of `return gr.Image.update(...)`.
  warnings.warn(
/home/tcv1tv/anaconda3/envs/tinygptv/lib/python3.9/site-packages/gradio/components/textbox.py:163: UserWarning: Using the update method is deprecated. Simply return a new object instead, e.g. `return gr.Textbox(...)` instead of `return gr.Textbox.update(...)`.
  warnings.warn(
/home/tcv1tv/anaconda3/envs/tinygptv/lib/python3.9/site-packages/gradio/components/button.py:89: UserWarning: Using the update method is deprecated. Simply return a new object instead, e.g. `return gr.Button(...)` instead of `return gr.Button.update(...)`.
  warnings.warn(
/home/tcv1tv/anaconda3/envs/tinygptv/lib/python3.9/site-packages/transformers/generation/utils.py:1518: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

It's possible that my corporate firewall blocks me from downloading links from a Python code. Will downloading manually the file in the warning help? Also, there is this comment about setting attention masks and pad tokens... Is this working as expected?

DLYuanGod commented 9 months ago

I think it may be that a more explicit instruction is needed to tell him that because the language model is Phi-2 (no instruction learning is done just a base model).

You can ask him for some descriptive statements that he might be better at.

Thanks for your feedback, we will focus on addressing this reason in the next release! example

VovaTch commented 9 months ago

And what about the The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior comment? Is it a training-only comment or am I missing something here?

DLYuanGod commented 9 months ago

This is just a normal warning that can be ignored because we defined new terminators in the training, thanks for your question!

lchen1019 commented 9 months ago

Is this normal? Nothing is output; but some pictures can be output? image

lchen1019 commented 9 months ago

It seems that the sentence need ‘.’ in the end. image

lchen1019 commented 9 months ago

Not only that, he likes to repeat one sentence over and over again. image