BAAI-DCAI / Bunny

A family of lightweight multimodal models.
Apache License 2.0
894 stars 67 forks source link

On the issue of Continuous Fine-tuning #82

Closed Gary2018X closed 2 months ago

Gary2018X commented 4 months ago

Thanks for your work I would like to know which effect would be better between continuous fine-tuning and fine-tuning multiple instructions at once?

Gary2018X commented 4 months ago

I tried, but there was an error while merging the models

Traceback (most recent call last):
  File "/Bunny/script/merge_lora_weights.py", line 26, in <module>
    merge_lora(args)
  File "/Bunny/script/merge_lora_weights.py", line 10, in merge_lora
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
  File "/Bunny/bunny/model/builder.py", line 58, in load_pretrained_model
    model = BunnyQwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained,
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3447, in from_pretrained
    no_split_modules = model._get_no_split_modules(device_map)
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1769, in _get_no_split_modules
    raise ValueError(
ValueError: SiglipVisionModel does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.

How should I solve it

Isaachhh commented 4 months ago

What is the merging command you use?

Gary2018X commented 4 months ago
python script/merge_lora_weights.py \
    --model-path ./checkpoints-qwen1.5-1.8b/bunny-lora-qwen1.5-1.8b \
    --model-base ./models/Qwen1.5-1.8B \
    --model-type qwen1.5-1.8b \
    --save-model-path ./models/model
Gary2018X commented 4 months ago

model config

{
  "_name_or_path": "./models/Qwen1.5-1.8B",
  "architectures": [
    "BunnyQwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
    "auto_map": {
    "AutoConfig": "configuration_bunny_qwen2.BunnyQwen2Config",
    "AutoModelForCausalLM": "modeling_bunny_qwen2.BunnyQwen2ForCausalLM"
  },
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "freeze_mm_mlp_adapter": false,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "image_aspect_ratio": "pad",
  "initializer_range": 0.02,
  "intermediate_size": 5504,
  "max_position_embeddings": 32768,
  "max_window_layers": 21,
  "mm_hidden_size": 1152,
  "mm_projector_lr": 2e-05,
  "mm_projector_type": "mlp2x_gelu",
  "mm_vision_tower": "./models/siglip-so400m-patch14-384",
  "model_type": "bunny-qwen2",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "num_key_value_heads": 16,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": false,
  "tokenizer_model_max_length": 2048,
  "tokenizer_padding_side": "right",
  "torch_dtype": "float16",
  "transformers_version": "4.39.1",
  "tune_mm_mlp_adapter": false,
  "use_cache": true,
  "use_mm_proj": true,
  "use_sliding_window": false,
  "continuous_training":true,
  "vocab_size": 151646
}

train.sh

#!/bin/bash

MODEL_TYPE=qwen1.5-1.8b

PRETRAIN_DIR=bunny-$MODEL_TYPE-pretrain
OUTPUT_DIR=bunny-lora-ct-$MODEL_TYPE

mkdir -p ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR

deepspeed bunny/train/train.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./script/deepspeed/zero3.json \
    --model_name_or_path ./models/merged_model \
    --model_type $MODEL_TYPE \
    --version bunny \
    --data_path ./data/Bunny.json \
    --image_folder ./data/image \
    --vision_tower ./models/siglip-so400m-patch14-384 \
    --mm_projector_type mlp2x_gelu \
    --image_aspect_ratio pad \
    --group_by_modality_length False \
    --bf16 True \
    --output_dir ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR \
    --num_train_epochs 5 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to none | tee 2>&1 ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR/log.txt
Gary2018X commented 4 months ago

When I update Transformers to the latest version, there is an new error

Traceback (most recent call last):
  File "./Bunny/script/merge_lora_weights.py", line 26, in <module>
    merge_lora(args)
  File "./Bunny/script/merge_lora_weights.py", line 10, in merge_lora
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
  File "./Bunny/bunny/model/builder.py", line 58, in load_pretrained_model
    model = BunnyQwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained,
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 887, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([151936, 2048]) in "weight" (which has shape torch.Size([151646, 2048])), this look incorrect.
Gary2018X commented 4 months ago

I know why the error was occured Merge models after training is completed The base model should be specified as ./models/merged_model instead of Qwen1.5-1.8B

Isaachhh commented 4 months ago

Great!

And I realize that when evalutaing the final model (continuously trained), continuous_training should be set to false. Please pay attention to https://github.com/BAAI-DCAI/Bunny/commit/28e761d4d6191a28bb5b8baa3a2c785e7d18191f.

Gary2018X commented 4 months ago

Ok, thank you very much for your patient answer

Gary2018X commented 4 months ago

Thanks for your work I would like to know which effect would be better between continuous fine-tuning and fine-tuning multiple instructions at once?

Is there any answer to this question?

Gary2018X commented 4 months ago

I conducted an experiment: Firstly, I used dataset A and prompt A to continuous fine-tuning model Bunny v1.0-2B-zh to obtain model A. This result is 0.5% worse than the result of direct instruction fine-tuning,This is acceptable to me.

Then I used dataset B and prompt B to continuous fine-tuning model A to obtain model B Next, use the B model to evaluate the results of tasks A and B separately, The result of task B is 5 points worse than direct fine-tuning, Basic loss of ability for task A

This is also worse than directly fine-tuning multiple instructions Directly fine tune tasks A and B, The A task has no difference in results compared to fine-tuning A separately,Task B is 10% worse than fine-tuning task B separately

Is there any trick that can guide you? Or is it that my approach is not appropriate?

basteran commented 4 months ago

When I update Transformers to the latest version, there is an new error

Traceback (most recent call last):
  File "./Bunny/script/merge_lora_weights.py", line 26, in <module>
    merge_lora(args)
  File "./Bunny/script/merge_lora_weights.py", line 10, in merge_lora
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
  File "./Bunny/bunny/model/builder.py", line 58, in load_pretrained_model
    model = BunnyQwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained,
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 887, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([151936, 2048]) in "weight" (which has shape torch.Size([151646, 2048])), this look incorrect.

Hi I have the same issue but with different size because of the pad_token_id: ValueError: Trying to set a tensor of shape torch.Size([128257, 4096]) in "weight" (which has shape torch.Size([128256, 4096])), this look incorrect.

How did you solve it?

Isaachhh commented 4 months ago

@basteran Does it relate to eos_token_id of llama-3? https://github.com/BAAI-DCAI/Bunny/issues/75

basteran commented 4 months ago

No, it is related to some other implementation that introduces the pad_token_id as here.

I managed to train the model with LoRA and now I want to merge the adapters back but I get the error above..

Isaachhh commented 4 months ago

@basteran We didn't try to expand the vocabulary, so maybe we couldn't help you.

basteran commented 4 months ago

@basteran We didn't try to expand the vocabulary, so maybe we couldn't help you.

What do you mean you didn't try to expand the vocabulary? I see these lines in your code. Aren't you adding the new pad_token_id to the vocabulary and overwriting the old one if it is not defined? Am I missing something?

Thanks for the help!

Isaachhh commented 4 months ago

@basteran What you mentioned is the code for evaluating not training. When training, we just use 128001 <|end_of_text|> as the padding token as here. But LLaVA++ seems defining a new token <pad> as the padding token here. So, the vocabulary size of Bunny should be 128256 but that of LLaVA++ should be 128257 (128256+<pad>).

basteran commented 4 months ago

Ok, I got it. So you add the pad_token_id only at "run" time, but you don't save it in the vocabulary.

Thank you very much for the help! Now I understand what's going on.. I am considering switching to your Bunny repository instead of LLaVA++ 😄

Isaachhh commented 4 months ago

@basteran Well, maybe need to distinguish token and token_id.

When training and running, Bunny uses an existing token <|end_of_text|> as the padding token. I'm not sure whether I can "save it". Because <|end_of_text|> is 128001 in the vocabulary, can I define a new token 128257 which is also <|end_of_text|>?

So, I just pick up an existing token serving as the padding token without modifying the tokenizer a lot.

Isaachhh commented 3 months ago

I conducted an experiment: Firstly, I used dataset A and prompt A to continuous fine-tuning model Bunny v1.0-2B-zh to obtain model A. This result is 0.5% worse than the result of direct instruction fine-tuning,This is acceptable to me.

Then I used dataset B and prompt B to continuous fine-tuning model A to obtain model B Next, use the B model to evaluate the results of tasks A and B separately, The result of task B is 5 points worse than direct fine-tuning, Basic loss of ability for task A

This is also worse than directly fine-tuning multiple instructions Directly fine tune tasks A and B, The A task has no difference in results compared to fine-tuning A separately,Task B is 10% worse than fine-tuning task B separately

Is there any trick that can guide you? Or is it that my approach is not appropriate?

@Gary2018X

There exists a complex and comprehensive influence of different kinds of data and the fraction of each. So it's hard to give a simple principle. The performance may be related to the knowledge area of each kind of data, the conflicts and cooperations. Whether to unfreeze the vision tower and the hype-parameters may also matter.

From my own perspective, fine-tuning multiple instructions at once (e.g. Bunny-695K + your own data) may be better.

Isaachhh commented 2 months ago

Close the issue for now if there's no further discussions. Feel free to reopen it if there's any other questions.