BAAI-DCAI / Bunny

A family of lightweight multimodal models.
Apache License 2.0
788 stars 59 forks source link

Catastrophic forgetting #42

Closed Gary2018X closed 2 months ago

Gary2018X commented 2 months ago

I used Lora to fine tune my own dataset, but the model only replied to the content I had trained on, and I didn't know any other common sense content but Bunny-v1_0-2B-zh is ok Do you have any training tricks? self model image Bunny-v1_0-2B-zh image

RussRobin commented 2 months ago

Hi @Gary2018X ,

Great thanks for your interest in Bunny!

Basically, when finetuning Bunny on datasets with large domain gap from Bunny_pretrain_laion_2m (pretrain set) and Bunny_695k (finetune set), you can try:

  1. Pretrain and Lora finetune Bunny on Bunny_pretrain_laion_2m and Bunny_695k
  2. Merge Lora weights with frozen LLM
  3. Add a new Lora module and finetune Lora on your custom dataset.

Currently in this GitHub repo, some modifications to our codes are needed if you want to add a new Lora and finetune. It’s scheduled that such training pipeline to the main branch, or replied under this issue. Stay tuned!

Feel free to comment on this issue if you have further questions or would like to share your inspiring ideas about it. Thank you again for your question!

Regards Russell BAAI

Gary2018X commented 2 months ago

Thank you very much for your reply!

Regards Gary

RussRobin commented 2 months ago

If you want to finetune Bunny-v1_0-2B-zh by adding a new lora to merged Bunny-v1_0-2B-zh (only the new lora and projector are trainable, and there are two loras in total), you may follow:

  1. Download lora weights
  2. Merge Lora with LLM:
    python script/merge_lora_weights.py \
    --model-path /path/to/bunny_lora_weights \
    --model-base /path/to/base_llm_model \
    --model-type qwen1.5-1.8b \
    --save-model-path /path/to/merged_model
  3. In script/train/finetune_lora.sh, change model_name_or_path to /path/to/merged_model
  4. We load mm_projector from merged weights, so delete --pretrain_mm_mlp_adapter in script/train/finetune_lora.sh
  5. You may customize the learning rate to fit your dataset

It's expected to see a lot of warnings going like: Some weights of the model checkpoint were not used when initializing BunnyQwenForCausalLM: [ model.vision_tower... ]. Ignore them. We load vision tower from downloaded --vision_tower, instead of saved weights in merged weights.

Just keep in mind that: two loras aren't guaranteed to work in your case. We don't have sufficient experimental data in support of this claim.

Please comment on this issue if you have probleming implenmenting, or you would like to share your thoughts.

Regards Russell BAAI

Gary2018X commented 2 months ago

Thank you very much for your professional answer There is no problem with the training process But there was a problem when I merged the models


  File "/root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/script/merge_lora_weights.py", line 26, in <module>
    merge_lora(args)
  File "/root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/script/merge_lora_weights.py", line 10, in merge_lora
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
  File "/root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/bunny/model/builder.py", line 53, in load_pretrained_model
    model = BunnyQwenForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained,
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3531, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3958, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 812, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([151936, 2048]) in "weight" (which has shape torch.Size([151646, 2048])), this look incorrect.```

Regards
Gary
RussRobin commented 2 months ago

Hi @Gary2018X ,

You may want to share your srcipts for merging and your model configs so we can help you debug.

Currently, Qwen has some bugs in vocab size. From our experience:

  1. After training Qwen, vocab_size is 151936, as shown in config.json
  2. After merging lora, vocab_size is 151646 (This is weird...)
  3. We train the merged model with an additional lora, it works
  4. If we want to merge the second lora into it, an error is throwed: ValueError: Trying to set a tensor of shape torch.Size([151646, 2560]) in "weight" (which has shape torch.Size([151936, 2560])), this look incorrect.

I have double checked our uploaded lora weights, it has vocab_size 151936, so it's not expected to get errors when merging the first lora. However, my error in merging the second lora is different from yours. May you please share more details with us?

Regards

Gary2018X commented 2 months ago

My question, I didn't express it clearly This error occurred in second lora I have completed the training on my own training set after merging bunny-qwen1.5-1.8b-siglip-lora with LLM

#!/bin/bash

MODEL_TYPE=qwen1.5-1.8b

PRETRAIN_DIR=bunny-$MODEL_TYPE-pretrain
OUTPUT_DIR=bunny-lora-juzao-base-$MODEL_TYPE

mkdir -p ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR

deepspeed bunny/train/train.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./script/deepspeed/zero3.json \
    --model_name_or_path /root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/base_model \
    --model_type $MODEL_TYPE \
    --version bunny \
    --data_path /root/siton-glusterfs-eaxtsxdfs/xts/data/s_v5/Bunny.json \
    --image_folder /root/siton-glusterfs-eaxtsxdfs/xts/data/s_v5/image \
    --vision_tower /root/siton-glusterfs-eaxtsxdfs/xts/models/siglip-so400m-patch14-384 \
    --mm_projector_type mlp2x_gelu \
    --image_aspect_ratio pad \
    --group_by_modality_length False \
    --bf16 True \
    --output_dir ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR \
    --num_train_epochs 3 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to none | tee 2>&1 ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR/log.txt
RussRobin commented 2 months ago

Sorry for the delay, we were working very hard to reproduce this error and find out reasons behind.

Quick answer: your first merging script and finetune_lora.sh was good, but the second merging script should be:

python script/merge_lora_weights.py \
    --model-path ./checkpoints-qwen1.5-1.8b/bunny-lora-juzao-base-qwen1.5-1.8b \
    --model-base /root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/base_model \
    --model-type qwen1.5-1.8b \
    --save-model-path ./juzao_model_base

In the second merging, you are trying to merge a new lora with the previously merged LLM+lora, so --model-base should be set to where you saved the LLM+lora (as in finetune_lora.sh).

Why all these things happen are mentioned here. It's padding things in tokenizer. If you encounter similar errors in the future, please check vocab_size in config.json in your --output_dir.

Reach out to us if you still have difficulty using Bunny in your project!

Regards

Gary2018X commented 2 months ago

Thank you very much for taking the time to answer my question. I have successfully merged the models. so sad the model output is not good

Gary2018X commented 2 months ago

Is this related to my inference code? num_beams=1, temperature=0.1, max_new_tokens=300

text_input = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n{prompt} ASSISTANT:"
text_chunks = [tokenizer(chunk).input_ids for chunk in text_input.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + text_chunks[1], dtype=torch.long).unsqueeze(0)
output_ids = model.generate(
            input_ids,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            num_beams=num_beams,
            do_sample=True,
            use_cache=True)[0]
llm_message = tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()

Regards

RussRobin commented 2 months ago

There are a wide variety of training data and parameter settings, so I'm not entirely certain how much assistance I can offer you on this case. Here are some basic insights:

  1. For Q1, the model doesn't stop generation correctly, so you may want to check eos_token.
  2. Try loading the model in cli, and see if the model works well.
  3. Check learning rate and the loss curve (you can draw it from log.txt).
  4. Double check the training dataset to see if it's correctly configured for MLLM training? In a lot of cases, introducing new datasets in finetuing makes the model underperform. You will need to wisely choose datasets.
  5. Try to ask some special instructions on domain-specific datasets.

Hope it helps!

Regards

RussRobin commented 2 months ago

I will close this issue since we have reached a consensus on codings.