Closed Gary2018X closed 2 months ago
I tried, but there was an error while merging the models
Traceback (most recent call last):
File "/Bunny/script/merge_lora_weights.py", line 26, in <module>
merge_lora(args)
File "/Bunny/script/merge_lora_weights.py", line 10, in merge_lora
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
File "/Bunny/bunny/model/builder.py", line 58, in load_pretrained_model
model = BunnyQwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained,
File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3447, in from_pretrained
no_split_modules = model._get_no_split_modules(device_map)
File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1769, in _get_no_split_modules
raise ValueError(
ValueError: SiglipVisionModel does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.
How should I solve it
What is the merging command you use?
python script/merge_lora_weights.py \
--model-path ./checkpoints-qwen1.5-1.8b/bunny-lora-qwen1.5-1.8b \
--model-base ./models/Qwen1.5-1.8B \
--model-type qwen1.5-1.8b \
--save-model-path ./models/model
model config
{
"_name_or_path": "./models/Qwen1.5-1.8B",
"architectures": [
"BunnyQwen2ForCausalLM"
],
"attention_dropout": 0.0,
"auto_map": {
"AutoConfig": "configuration_bunny_qwen2.BunnyQwen2Config",
"AutoModelForCausalLM": "modeling_bunny_qwen2.BunnyQwen2ForCausalLM"
},
"bos_token_id": 151643,
"eos_token_id": 151643,
"freeze_mm_mlp_adapter": false,
"hidden_act": "silu",
"hidden_size": 2048,
"image_aspect_ratio": "pad",
"initializer_range": 0.02,
"intermediate_size": 5504,
"max_position_embeddings": 32768,
"max_window_layers": 21,
"mm_hidden_size": 1152,
"mm_projector_lr": 2e-05,
"mm_projector_type": "mlp2x_gelu",
"mm_vision_tower": "./models/siglip-so400m-patch14-384",
"model_type": "bunny-qwen2",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"num_key_value_heads": 16,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"tokenizer_model_max_length": 2048,
"tokenizer_padding_side": "right",
"torch_dtype": "float16",
"transformers_version": "4.39.1",
"tune_mm_mlp_adapter": false,
"use_cache": true,
"use_mm_proj": true,
"use_sliding_window": false,
"continuous_training":true,
"vocab_size": 151646
}
train.sh
#!/bin/bash
MODEL_TYPE=qwen1.5-1.8b
PRETRAIN_DIR=bunny-$MODEL_TYPE-pretrain
OUTPUT_DIR=bunny-lora-ct-$MODEL_TYPE
mkdir -p ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR
deepspeed bunny/train/train.py \
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
--deepspeed ./script/deepspeed/zero3.json \
--model_name_or_path ./models/merged_model \
--model_type $MODEL_TYPE \
--version bunny \
--data_path ./data/Bunny.json \
--image_folder ./data/image \
--vision_tower ./models/siglip-so400m-patch14-384 \
--mm_projector_type mlp2x_gelu \
--image_aspect_ratio pad \
--group_by_modality_length False \
--bf16 True \
--output_dir ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR \
--num_train_epochs 5 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 500 \
--save_total_limit 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to none | tee 2>&1 ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR/log.txt
When I update Transformers to the latest version, there is an new error
Traceback (most recent call last):
File "./Bunny/script/merge_lora_weights.py", line 26, in <module>
merge_lora(args)
File "./Bunny/script/merge_lora_weights.py", line 10, in merge_lora
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
File "./Bunny/bunny/model/builder.py", line 58, in load_pretrained_model
model = BunnyQwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained,
File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
) = cls._load_pretrained_model(
File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 887, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([151936, 2048]) in "weight" (which has shape torch.Size([151646, 2048])), this look incorrect.
I know why the error was occured Merge models after training is completed The base model should be specified as ./models/merged_model instead of Qwen1.5-1.8B
Great!
And I realize that when evalutaing the final model (continuously trained), continuous_training
should be set to false
.
Please pay attention to https://github.com/BAAI-DCAI/Bunny/commit/28e761d4d6191a28bb5b8baa3a2c785e7d18191f.
Ok, thank you very much for your patient answer
Thanks for your work I would like to know which effect would be better between continuous fine-tuning and fine-tuning multiple instructions at once?
Is there any answer to this question?
I conducted an experiment: Firstly, I used dataset A and prompt A to continuous fine-tuning model Bunny v1.0-2B-zh to obtain model A. This result is 0.5% worse than the result of direct instruction fine-tuning,This is acceptable to me.
Then I used dataset B and prompt B to continuous fine-tuning model A to obtain model B Next, use the B model to evaluate the results of tasks A and B separately, The result of task B is 5 points worse than direct fine-tuning, Basic loss of ability for task A
This is also worse than directly fine-tuning multiple instructions Directly fine tune tasks A and B, The A task has no difference in results compared to fine-tuning A separately,Task B is 10% worse than fine-tuning task B separately
Is there any trick that can guide you? Or is it that my approach is not appropriate?
When I update Transformers to the latest version, there is an new error
Traceback (most recent call last): File "./Bunny/script/merge_lora_weights.py", line 26, in <module> merge_lora(args) File "./Bunny/script/merge_lora_weights.py", line 10, in merge_lora tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name, File "./Bunny/bunny/model/builder.py", line 58, in load_pretrained_model model = BunnyQwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained ) = cls._load_pretrained_model( File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 887, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs) File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device raise ValueError( ValueError: Trying to set a tensor of shape torch.Size([151936, 2048]) in "weight" (which has shape torch.Size([151646, 2048])), this look incorrect.
Hi I have the same issue but with different size because of the pad_token_id:
ValueError: Trying to set a tensor of shape torch.Size([128257, 4096]) in "weight" (which has shape torch.Size([128256, 4096])), this look incorrect.
How did you solve it?
@basteran Does it relate to eos_token_id
of llama-3? https://github.com/BAAI-DCAI/Bunny/issues/75
No, it is related to some other implementation that introduces the pad_token_id
as here.
I managed to train the model with LoRA and now I want to merge the adapters back but I get the error above..
@basteran We didn't try to expand the vocabulary, so maybe we couldn't help you.
@basteran We didn't try to expand the vocabulary, so maybe we couldn't help you.
What do you mean you didn't try to expand the vocabulary? I see these lines in your code. Aren't you adding the new pad_token_id
to the vocabulary and overwriting the old one if it is not defined? Am I missing something?
Thanks for the help!
@basteran
What you mentioned is the code for evaluating not training.
When training, we just use 128001 <|end_of_text|>
as the padding token as here. But LLaVA++ seems defining a new token <pad>
as the padding token here. So, the vocabulary size of Bunny should be 128256 but that of LLaVA++ should be 128257 (128256+<pad>
).
Ok, I got it. So you add the pad_token_id
only at "run" time, but you don't save it in the vocabulary.
Thank you very much for the help! Now I understand what's going on.. I am considering switching to your Bunny repository instead of LLaVA++ 😄
@basteran
Well, maybe need to distinguish token
and token_id
.
When training and running, Bunny uses an existing token <|end_of_text|>
as the padding token. I'm not sure whether I can "save it". Because <|end_of_text|>
is 128001 in the vocabulary, can I define a new token 128257 which is also <|end_of_text|>
?
So, I just pick up an existing token serving as the padding token without modifying the tokenizer a lot.
I conducted an experiment: Firstly, I used dataset A and prompt A to continuous fine-tuning model Bunny v1.0-2B-zh to obtain model A. This result is 0.5% worse than the result of direct instruction fine-tuning,This is acceptable to me.
Then I used dataset B and prompt B to continuous fine-tuning model A to obtain model B Next, use the B model to evaluate the results of tasks A and B separately, The result of task B is 5 points worse than direct fine-tuning, Basic loss of ability for task A
This is also worse than directly fine-tuning multiple instructions Directly fine tune tasks A and B, The A task has no difference in results compared to fine-tuning A separately,Task B is 10% worse than fine-tuning task B separately
Is there any trick that can guide you? Or is it that my approach is not appropriate?
@Gary2018X
There exists a complex and comprehensive influence of different kinds of data and the fraction of each. So it's hard to give a simple principle. The performance may be related to the knowledge area of each kind of data, the conflicts and cooperations. Whether to unfreeze the vision tower and the hype-parameters may also matter.
From my own perspective, fine-tuning multiple instructions at once (e.g. Bunny-695K + your own data) may be better.
Close the issue for now if there's no further discussions. Feel free to reopen it if there's any other questions.
Thanks for your work I would like to know which effect would be better between continuous fine-tuning and fine-tuning multiple instructions at once?