Catastrophic forgetting

Gary2018X commented 2 months ago

I used Lora to fine tune my own dataset, but the model only replied to the content I had trained on, and I didn't know any other common sense content but Bunny-v1_0-2B-zh is ok Do you have any training tricks？ self model Bunny-v1_0-2B-zh

RussRobin commented 2 months ago

Hi @Gary2018X ,

Great thanks for your interest in Bunny!

Basically, when finetuning Bunny on datasets with large domain gap from Bunny_pretrain_laion_2m (pretrain set) and Bunny_695k (finetune set), you can try:

Pretrain and Lora finetune Bunny on Bunny_pretrain_laion_2m and Bunny_695k
Merge Lora weights with frozen LLM
Add a new Lora module and finetune Lora on your custom dataset.

Currently in this GitHub repo, some modifications to our codes are needed if you want to add a new Lora and finetune. It’s scheduled that such training pipeline to the main branch, or replied under this issue. Stay tuned!

Feel free to comment on this issue if you have further questions or would like to share your inspiring ideas about it. Thank you again for your question!

Regards Russell BAAI

Gary2018X commented 2 months ago

Thank you very much for your reply!

model_name_or_path If I specify this parameter as Bunny-v1_0-2B-zh, Do I still need to do the first and second steps?
or how can i finetune based on Bunny-v1_0-2B-zh?

Regards Gary

RussRobin commented 2 months ago

If you want to finetune Bunny-v1_0-2B-zh by adding a new lora to merged Bunny-v1_0-2B-zh (only the new lora and projector are trainable, and there are two loras in total), you may follow:

Download lora weights

Merge Lora with LLM:

python script/merge_lora_weights.py \
--model-path /path/to/bunny_lora_weights \
--model-base /path/to/base_llm_model \
--model-type qwen1.5-1.8b \
--save-model-path /path/to/merged_model

In script/train/finetune_lora.sh, change model_name_or_path to /path/to/merged_model
We load mm_projector from merged weights, so delete --pretrain_mm_mlp_adapter in script/train/finetune_lora.sh
You may customize the learning rate to fit your dataset

It's expected to see a lot of warnings going like: Some weights of the model checkpoint were not used when initializing BunnyQwenForCausalLM: [ model.vision_tower... ]. Ignore them. We load vision tower from downloaded --vision_tower, instead of saved weights in merged weights.

Just keep in mind that: two loras aren't guaranteed to work in your case. We don't have sufficient experimental data in support of this claim.

Please comment on this issue if you have probleming implenmenting, or you would like to share your thoughts.

Regards Russell BAAI

Gary2018X commented 2 months ago

Thank you very much for your professional answer There is no problem with the training process But there was a problem when I merged the models


  File "/root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/script/merge_lora_weights.py", line 26, in <module>
    merge_lora(args)
  File "/root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/script/merge_lora_weights.py", line 10, in merge_lora
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
  File "/root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/bunny/model/builder.py", line 53, in load_pretrained_model
    model = BunnyQwenForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained,
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3531, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3958, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 812, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([151936, 2048]) in "weight" (which has shape torch.Size([151646, 2048])), this look incorrect.```

Regards
Gary

RussRobin commented 2 months ago

Hi @Gary2018X ,

You may want to share your srcipts for merging and your model configs so we can help you debug.

Currently, Qwen has some bugs in vocab size. From our experience:

After training Qwen, vocab_size is 151936, as shown in config.json
After merging lora, vocab_size is 151646 (This is weird...)
We train the merged model with an additional lora, it works
If we want to merge the second lora into it, an error is throwed: ValueError: Trying to set a tensor of shape torch.Size([151646, 2560]) in "weight" (which has shape torch.Size([151936, 2560])), this look incorrect.

I have double checked our uploaded lora weights, it has vocab_size 151936, so it's not expected to get errors when merging the first lora. However, my error in merging the second lora is different from yours. May you please share more details with us?

Regards

Gary2018X commented 2 months ago

My question, I didn't express it clearly This error occurred in second lora I have completed the training on my own training set after merging bunny-qwen1.5-1.8b-siglip-lora with LLM

Merge Lora with LLM

python script/merge_lora_weights.py \
--model-path /root/siton-glusterfs-eaxtsxdfs/xts/models/bunny-qwen1.5-1.8b-siglip-lora \
--model-base /root/siton-glusterfs-eaxtsxdfs/xts/models/Qwen1.5-1.8B \
--model-type qwen1.5-1.8b \
--save-model-path ./base_model

finetune_lora.sh

#!/bin/bash

MODEL_TYPE=qwen1.5-1.8b

PRETRAIN_DIR=bunny-$MODEL_TYPE-pretrain
OUTPUT_DIR=bunny-lora-juzao-base-$MODEL_TYPE

mkdir -p ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR

deepspeed bunny/train/train.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./script/deepspeed/zero3.json \
    --model_name_or_path /root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/base_model \
    --model_type $MODEL_TYPE \
    --version bunny \
    --data_path /root/siton-glusterfs-eaxtsxdfs/xts/data/s_v5/Bunny.json \
    --image_folder /root/siton-glusterfs-eaxtsxdfs/xts/data/s_v5/image \
    --vision_tower /root/siton-glusterfs-eaxtsxdfs/xts/models/siglip-so400m-patch14-384 \
    --mm_projector_type mlp2x_gelu \
    --image_aspect_ratio pad \
    --group_by_modality_length False \
    --bf16 True \
    --output_dir ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR \
    --num_train_epochs 3 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to none | tee 2>&1 ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR/log.txt

after finetune merge model sh

python script/merge_lora_weights.py \
--model-path ./checkpoints-qwen1.5-1.8b/bunny-lora-juzao-base-qwen1.5-1.8b \
--model-base /root/siton-glusterfs-eaxtsxdfs/xts/models/Qwen1.5-1.8B \
--model-type qwen1.5-1.8b \
--save-model-path ./juzao_model_base

RussRobin commented 2 months ago

Sorry for the delay, we were working very hard to reproduce this error and find out reasons behind.

Quick answer: your first merging script and finetune_lora.sh was good, but the second merging script should be:

python script/merge_lora_weights.py \
    --model-path ./checkpoints-qwen1.5-1.8b/bunny-lora-juzao-base-qwen1.5-1.8b \
    --model-base /root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/base_model \
    --model-type qwen1.5-1.8b \
    --save-model-path ./juzao_model_base

In the second merging, you are trying to merge a new lora with the previously merged LLM+lora, so --model-base should be set to where you saved the LLM+lora (as in finetune_lora.sh).

Why all these things happen are mentioned here. It's padding things in tokenizer. If you encounter similar errors in the future, please check vocab_size in config.json in your --output_dir.

Reach out to us if you still have difficulty using Bunny in your project!

Regards

Gary2018X commented 2 months ago

Thank you very much for taking the time to answer my question. I have successfully merged the models. so sad the model output is not good

q1 你是谁 base_model:I am an AI language model. after lora model:我是AI语言模型，没有实体，没有生命，没有自我意识，没有情感，没有行动，没有意义，没有目的，没有动机，没有喜好，没有恐惧，没有焦虑，没有挫折，没有失败，没有成功，没有变化，没有发展，没有进展，没有结果，没有目的，没有动机，没有喜好，没有恐惧，没有焦虑，没有挫折，没有失败，没有成功，没有变化，没有发展，没有进展，没有结果，没有目的，没有动机，没有喜好，没有恐惧，没有焦虑，没有挫折，没有失败，没有成功，没有变化，没有发展，没有进展，没有结果，没有目的，没有动机，没有喜好，没有恐惧，没有焦虑，没有挫折，没有失败，没有成功，没有变化，没有发展，没有进展，没有结果，没有目的，没有动机，没有喜好，没有恐惧，没有焦虑，没有挫折，没有失败，没有成功，没有变化，没有发展，没有进展，没有结果，没有目的，没有动机，没有喜好，没有恐惧，没有焦虑，没有挫折，没有失败，没有成功，没有变化，没有发展，没有进展，没有结果，没有目的，没有动机，没有喜好，没有恐惧，没有焦虑，没有挫折，没有失败，没有成功，没有变化，没有发展，没有进展，没有结果，没有目的，没有动机，没有喜好，没有恐惧，没有焦虑，没有挫折，没有失败，没有
q2 你是一名消化内镜领域专家，请合并重复内容，尽可能不要删除已有信息，简化以下描述，合并成一句话:黏膜光滑，未见充血水肿;黏膜光滑，可见少量黄绿色胆汁;黏膜粗糙，可见黏膜呈裂纹样可见散在发红凹陷灶，相互融合，边界清晰:黏膜粗糙，红白相间，以白为主，可见散在发红凹陷灶，相互融合，边界清晰。 base_model:消化内镜检查结果显示黏膜光滑，未见充血水肿，黏膜粗糙，可见黏膜呈裂纹样，散在发红凹陷灶，相互融合，边界清晰。 better answer after lora model:可见一处发红、平坦型病变，形态规则，表面糜烂，边界清晰，未见出血，综合判断为非瘤变。

Gary2018X commented 2 months ago

Is this related to my inference code? num_beams=1, temperature=0.1, max_new_tokens=300

text_input = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n{prompt} ASSISTANT:"
text_chunks = [tokenizer(chunk).input_ids for chunk in text_input.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + text_chunks[1], dtype=torch.long).unsqueeze(0)
output_ids = model.generate(
            input_ids,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            num_beams=num_beams,
            do_sample=True,
            use_cache=True)[0]
llm_message = tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()

Regards

RussRobin commented 2 months ago

There are a wide variety of training data and parameter settings, so I'm not entirely certain how much assistance I can offer you on this case. Here are some basic insights:

For Q1, the model doesn't stop generation correctly, so you may want to check eos_token.
Try loading the model in cli, and see if the model works well.
Check learning rate and the loss curve (you can draw it from log.txt).
Double check the training dataset to see if it's correctly configured for MLLM training? In a lot of cases, introducing new datasets in finetuing makes the model underperform. You will need to wisely choose datasets.
Try to ask some special instructions on domain-specific datasets.

Hope it helps!

Regards

RussRobin commented 2 months ago

I will close this issue since we have reached a consensus on codings.

BAAI-DCAI / Bunny

Catastrophic forgetting #42