Closed zhenpengguo closed 8 months ago
关键参数: --learning_rate 2e-4 \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 16 \ --dataloader_drop_last True \ --query_max_len 512 \ --passage_max_len 512 \ --train_group_size 16 \ --logging_steps 1 \ --save_steps 2000 \ --save_total_limit 50 \ --ddp_find_unused_parameters False \ --gradient_checkpointing \ --deepspeed stage1.json \ --warmup_ratio 0.1 \ --bf16 \ --use_lora False \ --lora_rank 32 \ --lora_alpha 64 \ --use_flash_attn True \ --target_modules q_proj k_proj v_proj o_proj \ --start_layer 8 \ --head_multi True \ --head_type simple \ --lora_extra_parameters linear_head
这些初始化是在模型加载后进行的: https://github.com/FlagOpen/FlagEmbedding/blob/49d1e37aa4d6afa4353e3e10df09a004f6210fc3/FlagEmbedding/llm_reranker/finetune_for_layerwise/load_model.py#L35 训练的层很多的话,loss刚开始确实是比较大的,这个在后面会降下来 【效果很差,比没有微调还差】这个指的是rerank结果吗,有结果/例子吗
是的,效果很差指的是rerank结果,跑10个epoch,训练loss 从最初的170多下降到0.0005, 取第9个epoch进行评测,在同一评测集上的测试取Top n: bge-reranker-v2-minicpm-layerwise 不微调:命中率有90% bge-reranker-v2-minicpm-layerwise 微调后:命中率才40%
另外我发现,在部署微调后的模型时,会有如下信息:
Some weights of LayerWiseMiniCPMForCausalLM were not initialized from the model checkpoint at ../FlagEmbedding/output/rerankV8_rerankV1_V5/checkpoint-234 and are newly initialized: ['lm_head.0.linear_head.weight', 'lm_head.1.linear_head.weight', 'lm_head.10.linear_head.weight', 'lm_head.11.linear_head.weight', 'lm_head.12.linear_head.weight', 'lm_head.13.linear_head.weight', 'lm_head.14.linear_head.weight', 'lm_head.15.linear_head.weight', 'lm_head.16.linear_head.weight', 'lm_head.17.linear_head.weight', 'lm_head.18.linear_head.weight', 'lm_head.19.linear_head.weight', 'lm_head.2.linear_head.weight', 'lm_head.20.linear_head.weight', 'lm_head.21.linear_head.......]
- This IS expected if you are initializing LayerWiseMiniCPMForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LayerWiseMiniCPMForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
我在部署没有微调的bge-reranker-v2-minicpm-layerwise 时,并没有上述的信息。
在对微调后的模型进行测试: 测试case:
[
["what is panda?", "hi"],
["what is panda?", "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China."]
]
模型输出:[11.1875, 1.7265625]
这个分数不符合预期。
问题: 1、结果相差太大,感觉是部署的问题?因为提示信息显示有部分参数没有初始化。微调生成的模型文件相对于原生bge-reranker-v2-minicpm-layerwise 少了部分文件,我的做法是:将缺少的文件,例如config.json 等 补全到微调生成的模型文件中。
2、根据您的回复【训练的层很多的话,loss刚开始确实是比较大的】,如果要修改微调的层数,是调整参数 --start_layer 的值?,是否有推荐的值?
谢谢,close~
Some weights of the model checkpoint at ./BAAI/bge-reranker-v2-minicpm-layerwise were not used when initializing LayerWiseMiniCPMForCausalLM: ['lm_head.0.linear_head.weight', 'lm_head.1.linear_head.weight', 'lm_head.10.linear_head.weight', 'lm_head.11.linear_head.weight', 'lm_head.12.linear_head.weight', 'lm_head.13.linear_head.weight', 'lm_head.14.linear_head.weight', 'lm_head.15.linear_head.weight', 'lm_head.16.linear_head.weight', 'lm_head.17.linear_head.weight', 'lm_head.18.linear_head.weight', 'lm_head.19.linear_head.weight', 'lm_head.2.linear_head.weight', 'lm_head.20.linear_head.weight', 'lm_head.21.linear_head.weight', 'lm_head.22.linear_head.weight', 'lm_head.23.linear_head.weight', 'lm_head.24.linear_head.weight', 'lm_head.25.linear_head.weight', 'lm_head.26.linear_head.weight', 'lm_head.27.linear_head.weight', 'lm_head.28.linear_head.weight', 'lm_head.29.linear_head.weight', 'lm_head.3.linear_head.weight', 'lm_head.30.linear_head.weight', 'lm_head.31.linear_head.weight', 'lm_head.32.linear_head.weight', 'lm_head.4.linear_head.weight', 'lm_head.5.linear_head.weight', 'lm_head.6.linear_head.weight', 'lm_head.7.linear_head.weight', 'lm_head.8.linear_head.weight', 'lm_head.9.linear_head.weight']
训练loss 初始为: {'loss': 175.785, 'grad_norm': 262.08528251792393, 'learning_rate': 0.0, 'epoch': 0.04}
0%| | 1/250 [00:35<2:25:52, 35.15s/it] 1%| | 2/250 [01:00<2:00:57, 29.26s/it]
{'loss': 183.8679, 'grad_norm': 142.72274258155215, 'learning_rate': 4.306765580733931e-05, 'epoch': 0.08}
1%| | 2/250 [01:00<2:00:57, 29.26s/it] 1%| | 3/250 [01:24<1:50:33, 26.86s/it]
{'loss': 183.2763, 'grad_norm': 144.90638141572649, 'learning_rate': 6.826061944859854e-05, 'epoch': 0.12}
1%| | 3/250 [01:24<1:50:33, 26.86s/it] 2%|▏ | 4/250 [01:48<1:45:34, 25.75s/it]
{'loss': 170.4639, 'grad_norm': 45.67741263896878, 'learning_rate': 8.613531161467861e-05, 'epoch': 0.16}
虽然最后loss变低,但是效果很差,比没有微调还差。