Closed Gary2018X closed 2 months ago
Hi @Gary2018X ,
Great thanks for your interest in Bunny!
Basically, when finetuning Bunny on datasets with large domain gap from Bunny_pretrain_laion_2m (pretrain set) and Bunny_695k (finetune set), you can try:
Currently in this GitHub repo, some modifications to our codes are needed if you want to add a new Lora and finetune. It’s scheduled that such training pipeline to the main branch, or replied under this issue. Stay tuned!
Feel free to comment on this issue if you have further questions or would like to share your inspiring ideas about it. Thank you again for your question!
Regards Russell BAAI
Thank you very much for your reply!
Regards Gary
If you want to finetune Bunny-v1_0-2B-zh by adding a new lora to merged Bunny-v1_0-2B-zh (only the new lora and projector are trainable, and there are two loras in total), you may follow:
python script/merge_lora_weights.py \
--model-path /path/to/bunny_lora_weights \
--model-base /path/to/base_llm_model \
--model-type qwen1.5-1.8b \
--save-model-path /path/to/merged_model
script/train/finetune_lora.sh
, change model_name_or_path to /path/to/merged_model
--pretrain_mm_mlp_adapter
in script/train/finetune_lora.sh
It's expected to see a lot of warnings going like: Some weights of the model checkpoint were not used when initializing BunnyQwenForCausalLM: [ model.vision_tower... ]. Ignore them. We load vision tower from downloaded --vision_tower, instead of saved weights in merged weights.
Just keep in mind that: two loras aren't guaranteed to work in your case. We don't have sufficient experimental data in support of this claim.
Please comment on this issue if you have probleming implenmenting, or you would like to share your thoughts.
Regards Russell BAAI
Thank you very much for your professional answer There is no problem with the training process But there was a problem when I merged the models
File "/root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/script/merge_lora_weights.py", line 26, in <module>
merge_lora(args)
File "/root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/script/merge_lora_weights.py", line 10, in merge_lora
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
File "/root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/bunny/model/builder.py", line 53, in load_pretrained_model
model = BunnyQwenForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained,
File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3531, in from_pretrained
) = cls._load_pretrained_model(
File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3958, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 812, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([151936, 2048]) in "weight" (which has shape torch.Size([151646, 2048])), this look incorrect.```
Regards
Gary
Hi @Gary2018X ,
You may want to share your srcipts for merging and your model configs so we can help you debug.
Currently, Qwen has some bugs in vocab size. From our experience:
config.json
ValueError: Trying to set a tensor of shape torch.Size([151646, 2560]) in "weight" (which has shape torch.Size([151936, 2560])), this look incorrect.
I have double checked our uploaded lora weights, it has vocab_size 151936, so it's not expected to get errors when merging the first lora. However, my error in merging the second lora is different from yours. May you please share more details with us?
Regards
My question, I didn't express it clearly This error occurred in second lora I have completed the training on my own training set after merging bunny-qwen1.5-1.8b-siglip-lora with LLM
python script/merge_lora_weights.py \
--model-path /root/siton-glusterfs-eaxtsxdfs/xts/models/bunny-qwen1.5-1.8b-siglip-lora \
--model-base /root/siton-glusterfs-eaxtsxdfs/xts/models/Qwen1.5-1.8B \
--model-type qwen1.5-1.8b \
--save-model-path ./base_model
#!/bin/bash
MODEL_TYPE=qwen1.5-1.8b
PRETRAIN_DIR=bunny-$MODEL_TYPE-pretrain
OUTPUT_DIR=bunny-lora-juzao-base-$MODEL_TYPE
mkdir -p ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR
deepspeed bunny/train/train.py \
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
--deepspeed ./script/deepspeed/zero3.json \
--model_name_or_path /root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/base_model \
--model_type $MODEL_TYPE \
--version bunny \
--data_path /root/siton-glusterfs-eaxtsxdfs/xts/data/s_v5/Bunny.json \
--image_folder /root/siton-glusterfs-eaxtsxdfs/xts/data/s_v5/image \
--vision_tower /root/siton-glusterfs-eaxtsxdfs/xts/models/siglip-so400m-patch14-384 \
--mm_projector_type mlp2x_gelu \
--image_aspect_ratio pad \
--group_by_modality_length False \
--bf16 True \
--output_dir ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR \
--num_train_epochs 3 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 500 \
--save_total_limit 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to none | tee 2>&1 ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR/log.txt
python script/merge_lora_weights.py \
--model-path ./checkpoints-qwen1.5-1.8b/bunny-lora-juzao-base-qwen1.5-1.8b \
--model-base /root/siton-glusterfs-eaxtsxdfs/xts/models/Qwen1.5-1.8B \
--model-type qwen1.5-1.8b \
--save-model-path ./juzao_model_base
Sorry for the delay, we were working very hard to reproduce this error and find out reasons behind.
Quick answer: your first merging script and finetune_lora.sh
was good, but the second merging script should be:
python script/merge_lora_weights.py \
--model-path ./checkpoints-qwen1.5-1.8b/bunny-lora-juzao-base-qwen1.5-1.8b \
--model-base /root/siton-glusterfs-eaxtsxdfs/xts/projects/Bunny/base_model \
--model-type qwen1.5-1.8b \
--save-model-path ./juzao_model_base
In the second merging, you are trying to merge a new lora with the previously merged LLM+lora, so --model-base should be set to where you saved the LLM+lora (as in finetune_lora.sh
).
Why all these things happen are mentioned here. It's padding things in tokenizer. If you encounter similar errors in the future, please check vocab_size
in config.json
in your --output_dir
.
Reach out to us if you still have difficulty using Bunny in your project!
Regards
Thank you very much for taking the time to answer my question. I have successfully merged the models. so sad the model output is not good
Is this related to my inference code? num_beams=1, temperature=0.1, max_new_tokens=300
text_input = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n{prompt} ASSISTANT:"
text_chunks = [tokenizer(chunk).input_ids for chunk in text_input.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + text_chunks[1], dtype=torch.long).unsqueeze(0)
output_ids = model.generate(
input_ids,
max_new_tokens=max_new_tokens,
temperature=temperature,
num_beams=num_beams,
do_sample=True,
use_cache=True)[0]
llm_message = tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()
Regards
There are a wide variety of training data and parameter settings, so I'm not entirely certain how much assistance I can offer you on this case. Here are some basic insights:
Hope it helps!
Regards
I will close this issue since we have reached a consensus on codings.
I used Lora to fine tune my own dataset, but the model only replied to the content I had trained on, and I didn't know any other common sense content but Bunny-v1_0-2B-zh is ok Do you have any training tricks? self model Bunny-v1_0-2B-zh