TinyLLaVA / TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models
https://arxiv.org/abs/2402.14289
Apache License 2.0
635 stars 61 forks source link

如何merge完整模型 #122

Closed liboaccn closed 3 weeks ago

liboaccn commented 1 month ago

我使用doc中的方法进行pretrain和Finetune 过程

生成的文件目录为

-rw-r--r-- 1 work work     11000 Sep 25 08:10 adapter_config.json
-rw-r--r-- 1 work work 323020440 Sep 25 08:10 adapter_model.safetensors
-rw-r--r-- 1 work work       605 Sep 25 08:09 added_tokens.json
drwxr-xr-x 3 work work      4096 Oct  3 01:26 checkpoint-629
-rw-r--r-- 1 work work      2065 Sep 25 08:09 config.json
drwxr-xr-x 2 work work      4096 Sep 25 08:10 connector
drwxr-xr-x 2 work work      4096 Oct  3 01:28 language_model
-rw-r--r-- 1 work work       836 Sep 24 17:07 log.txt
-rw-r--r-- 1 work work   1671853 Sep 25 08:09 merges.txt
-rw-r--r-- 1 work work      5076 Sep 25 08:10 README.md
drwxr-xr-x 6 work work      4096 Sep 24 17:07 runs
-rw-r--r-- 1 work work       645 Sep 25 08:09 special_tokens_map.json
-rw-r--r-- 1 work work      7342 Sep 25 08:09 tokenizer_config.json
-rw-r--r-- 1 work work    110069 Sep 25 08:09 trainer_state.json
drwxr-xr-x 2 work work      4096 Oct  3 01:28 vision_tower
-rw-r--r-- 1 work work   3383407 Sep 25 08:09 vocab.json

如何merge成官方仓库的格式呢 ,如 tinyllava/TinyLLaVA-Phi-2-SigLIP-3.1B

ZhangXJ199 commented 1 month ago

Finetune是在Pretrain生成的模型基础上进行的,因此最终模型就是tiny-llava-.....-finetune,不需要将两个模型merge

liboaccn commented 4 weeks ago

Finetune是在Pretrain生成的模型基础上进行的,因此最终模型就是tiny-llava-.....-finetune,不需要将两个模型merge

sorry 我可能没有解释清楚我的问题,我不是需要合并pretrain和Finetune 的结果,我的意思是说当前ft后的目录结构中有language_model、vision_tower、connector 多个目录,但是我看tinyllava/TinyLLaVA-Phi-2-SigLIP-3.1B 官方库中并不是结构,所以我的问题是如何merge成官方库这样的目录结构 ,即 不需要分别加在3个模型和lora模型 来使用

ZhangXJ199 commented 4 weeks ago

TinyLLaVA-Phi-2-SigLIP-3.1B的目录结构是finetune后的结果,我看您发的目录结构中含有connector、language_model、vision_tower等文件夹,这个目录应该是pretrain后的结果

ZhangXJ199 commented 4 weeks ago

pretrain后的目录结构: 微信图片_20241009075810 finetune后的目录结构: 微信图片_20241009075955

liboaccn commented 4 weeks ago

TinyLLaVA-Phi-2-SigLIP-3.1B的目录结构是finetune后的结果,我看您发的目录结构中含有connector、language_model、vision_tower等文件夹,这个目录应该是pretrain后的结果

这个我就是我的疑惑,我确实是使用Finetune脚本 执行的结果,结果跟pretrain目录结构一样,并不是您截图的样子,请问是需要在哪单独设置么


deepspeed --include localhost:0,1,2,3,4,5,6,7 --master_port 39501 ../../../tinyllava/train/train.py \
    --deepspeed ../../zero2.json \
    --data_path  $DATA_PATH \
    --image_folder $IMAGE_PATH \
    --is_multimodal True \
    --conv_version $CONV_VERSION \
    --model_name_or_path $LLM_VERSION \
    --vision_tower $VT_VERSION \
    --vision_tower2 "$VT_VERSION2" \
    --connector_type $CN_VERSION \
    --mm_vision_select_layer -2 \
    --image_aspect_ratio square \
    --attn_implementation flash_attention_2 \
    --bf16 True \
    --training_recipe $TRAIN_RECIPE \
    --tune_type_llm lora \
    --tune_type_vision_tower frozen\
    --tune_vision_tower_from_layer 0 \
    --tune_type_connector full \
    --group_by_modality_length True \
    --pretrained_model_path /ssd3/data/mmt/llama3-pretrain \
    --output_dir /ssd3/data/mmt/llama3-sft-lora \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 32 \
    --eval_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length $MODEL_MAX_LENGTH \
    --gradient_checkpointing True \
    --dataloader_num_workers 8 \
    --lazy_preprocess True \
    --report_to tensorboard \
    --tokenizer_use_fast False \
    --run_name llama3-sft-lora 

同时 我使用qfromer作为 connecter的时候 ,在训练过程没有报错,但是推理过程,有一个报错

[2024-10-10 01:57:32,420] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Loading language_model_ckp_path /ssd3/data/mmt/llama3-sft-test-lora/language_model/pytorch_model.bin
Loading vision_tower_ckp_path /ssd3/data/mmt/llama3-sft-test-lora/vision_tower/pytorch_model.bin
Loading connector_ckp_path /ssd3/data/mmt/llama3-sft-test-lora/connector/pytorch_model.bin
Traceback (most recent call last):
File “/home/users/work/code/coling/code/ours/./test.py”, line 59, in <module>
model, tokenizer, image_processor, context_len = load_pretrained_model(model_path)
File “/home/users/work/code/coling/code/ours/tinyllava/model/load_model.py”, line 54, in load_pretrained_model
model.connector.load_state_dict(connector_ckp)
File “/home/users/work/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 2152, in load_state_dict
raise RuntimeError(‘Error(s) in loading state_dict for {}:\n\t{}’.format(
RuntimeError: Error(s) in loading state_dict for QFormerConnector:
Missing key(s) in state_dict: “_connector.bert.embeddings.position_ids”.

我不确定是否跟前面的目录结构有影响,是模型加载的不对或者训练生成的结果不对引起的,另外 添加您的vx (tinyllava)能否通过下

ZhangXJ199 commented 4 weeks ago

目录结构不同由于train_recipe使用了lora(详见tinyllava/training_recipe中base与lora中save函数的区别),要使用这种目录结构进行推理需要修改tinyllava/model/load_model.py中的load_pretrained_model部分

liboaccn commented 4 weeks ago

目录结构不同由于train_recipe使用了lora(详见tinyllava/training_recipe中base与lora中save函数的区别),要使用这种目录结构进行推理需要修改tinyllava/model/load_model.py中的load_pretrained_model部分

我看代码 如果需要lora sft ,train_recipe 就只能选lora吧,无法选择common的方式。 您的意思 是 如果 train_recipe=lora的方式 Finetune阶段的结果 现在的代码无法正常加载么,所以我最初的问题就是 train_recipe=lora + tune_type_llm =lora ft后的产出 如何能合并成类似 common Finetune后的格式

YingHuTsing commented 3 weeks ago

lora finetune后,目录结构是llm/vision tower/connector分开存的,和finetune非lora全量调LLM的目录结构不一样。

YingHuTsing commented 3 weeks ago

TinyLLaVA-Phi-2-SigLIP-3.1B的目录结构是finetune后的结果,我看您发的目录结构中含有connector、language_model、vision_tower等文件夹,这个目录应该是pretrain后的结果

这个我就是我的疑惑,我确实是使用Finetune脚本 执行的结果,结果跟pretrain目录结构一样,并不是您截图的样子,请问是需要在哪单独设置么


deepspeed --include localhost:0,1,2,3,4,5,6,7 --master_port 39501 ../../../tinyllava/train/train.py \
    --deepspeed ../../zero2.json \
    --data_path  $DATA_PATH \
    --image_folder $IMAGE_PATH \
    --is_multimodal True \
    --conv_version $CONV_VERSION \
    --model_name_or_path $LLM_VERSION \
    --vision_tower $VT_VERSION \
    --vision_tower2 "$VT_VERSION2" \
    --connector_type $CN_VERSION \
    --mm_vision_select_layer -2 \
    --image_aspect_ratio square \
    --attn_implementation flash_attention_2 \
    --bf16 True \
    --training_recipe $TRAIN_RECIPE \
    --tune_type_llm lora \
    --tune_type_vision_tower frozen\
    --tune_vision_tower_from_layer 0 \
    --tune_type_connector full \
    --group_by_modality_length True \
    --pretrained_model_path /ssd3/data/mmt/llama3-pretrain \
    --output_dir /ssd3/data/mmt/llama3-sft-lora \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 32 \
    --eval_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length $MODEL_MAX_LENGTH \
    --gradient_checkpointing True \
    --dataloader_num_workers 8 \
    --lazy_preprocess True \
    --report_to tensorboard \
    --tokenizer_use_fast False \
    --run_name llama3-sft-lora 

同时 我使用qfromer作为 connecter的时候 ,在训练过程没有报错,但是推理过程,有一个报错

[2024-10-10 01:57:32,420] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Loading language_model_ckp_path /ssd3/data/mmt/llama3-sft-test-lora/language_model/pytorch_model.bin
Loading vision_tower_ckp_path /ssd3/data/mmt/llama3-sft-test-lora/vision_tower/pytorch_model.bin
Loading connector_ckp_path /ssd3/data/mmt/llama3-sft-test-lora/connector/pytorch_model.bin
Traceback (most recent call last):
File “/home/users/work/code/coling/code/ours/./test.py”, line 59, in <module>
model, tokenizer, image_processor, context_len = load_pretrained_model(model_path)
File “/home/users/work/code/coling/code/ours/tinyllava/model/load_model.py”, line 54, in load_pretrained_model
model.connector.load_state_dict(connector_ckp)
File “/home/users/work/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 2152, in load_state_dict
raise RuntimeError(‘Error(s) in loading state_dict for {}:\n\t{}’.format(
RuntimeError: Error(s) in loading state_dict for QFormerConnector:
Missing key(s) in state_dict: “_connector.bert.embeddings.position_ids”.

我不确定是否跟前面的目录结构有影响,是模型加载的不对或者训练生成的结果不对引起的,另外 添加您的vx (tinyllava)能否通过下

vx已通过。connector使用qformer,finetune用lora微调,推理过程是会有这样的报错,是我们代码的bug。tinyllava/model/load_model.py文件的第51行需要加strict=False,如下 model.connector.load_state_dict(connector_ckp, strict=False)