FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
7.78k stars 567 forks source link

Fine-tune bge-reranker-v2-minicpm-layerwise ,loss过高,查看日志,发现模型参数有部分没有被初始化。 #629

Closed zhenpengguo closed 8 months ago

zhenpengguo commented 8 months ago

Some weights of the model checkpoint at ./BAAI/bge-reranker-v2-minicpm-layerwise were not used when initializing LayerWiseMiniCPMForCausalLM: ['lm_head.0.linear_head.weight', 'lm_head.1.linear_head.weight', 'lm_head.10.linear_head.weight', 'lm_head.11.linear_head.weight', 'lm_head.12.linear_head.weight', 'lm_head.13.linear_head.weight', 'lm_head.14.linear_head.weight', 'lm_head.15.linear_head.weight', 'lm_head.16.linear_head.weight', 'lm_head.17.linear_head.weight', 'lm_head.18.linear_head.weight', 'lm_head.19.linear_head.weight', 'lm_head.2.linear_head.weight', 'lm_head.20.linear_head.weight', 'lm_head.21.linear_head.weight', 'lm_head.22.linear_head.weight', 'lm_head.23.linear_head.weight', 'lm_head.24.linear_head.weight', 'lm_head.25.linear_head.weight', 'lm_head.26.linear_head.weight', 'lm_head.27.linear_head.weight', 'lm_head.28.linear_head.weight', 'lm_head.29.linear_head.weight', 'lm_head.3.linear_head.weight', 'lm_head.30.linear_head.weight', 'lm_head.31.linear_head.weight', 'lm_head.32.linear_head.weight', 'lm_head.4.linear_head.weight', 'lm_head.5.linear_head.weight', 'lm_head.6.linear_head.weight', 'lm_head.7.linear_head.weight', 'lm_head.8.linear_head.weight', 'lm_head.9.linear_head.weight']

训练loss 初始为: {'loss': 175.785, 'grad_norm': 262.08528251792393, 'learning_rate': 0.0, 'epoch': 0.04}

0%| | 1/250 [00:35<2:25:52, 35.15s/it] 1%| | 2/250 [01:00<2:00:57, 29.26s/it]

{'loss': 183.8679, 'grad_norm': 142.72274258155215, 'learning_rate': 4.306765580733931e-05, 'epoch': 0.08}

1%| | 2/250 [01:00<2:00:57, 29.26s/it] 1%| | 3/250 [01:24<1:50:33, 26.86s/it]

{'loss': 183.2763, 'grad_norm': 144.90638141572649, 'learning_rate': 6.826061944859854e-05, 'epoch': 0.12}

1%| | 3/250 [01:24<1:50:33, 26.86s/it] 2%|▏ | 4/250 [01:48<1:45:34, 25.75s/it]

{'loss': 170.4639, 'grad_norm': 45.67741263896878, 'learning_rate': 8.613531161467861e-05, 'epoch': 0.16}

虽然最后loss变低,但是效果很差,比没有微调还差。

zhenpengguo commented 8 months ago

关键参数: --learning_rate 2e-4 \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 16 \ --dataloader_drop_last True \ --query_max_len 512 \ --passage_max_len 512 \ --train_group_size 16 \ --logging_steps 1 \ --save_steps 2000 \ --save_total_limit 50 \ --ddp_find_unused_parameters False \ --gradient_checkpointing \ --deepspeed stage1.json \ --warmup_ratio 0.1 \ --bf16 \ --use_lora False \ --lora_rank 32 \ --lora_alpha 64 \ --use_flash_attn True \ --target_modules q_proj k_proj v_proj o_proj \ --start_layer 8 \ --head_multi True \ --head_type simple \ --lora_extra_parameters linear_head

545999961 commented 8 months ago

这些初始化是在模型加载后进行的: https://github.com/FlagOpen/FlagEmbedding/blob/49d1e37aa4d6afa4353e3e10df09a004f6210fc3/FlagEmbedding/llm_reranker/finetune_for_layerwise/load_model.py#L35 训练的层很多的话,loss刚开始确实是比较大的,这个在后面会降下来 【效果很差,比没有微调还差】这个指的是rerank结果吗,有结果/例子吗

zhenpengguo commented 8 months ago

是的,效果很差指的是rerank结果,跑10个epoch,训练loss 从最初的170多下降到0.0005, 取第9个epoch进行评测,在同一评测集上的测试取Top n: bge-reranker-v2-minicpm-layerwise 不微调:命中率有90% bge-reranker-v2-minicpm-layerwise 微调后:命中率才40%

另外我发现,在部署微调后的模型时,会有如下信息:

Some weights of LayerWiseMiniCPMForCausalLM were not initialized from the model checkpoint at ../FlagEmbedding/output/rerankV8_rerankV1_V5/checkpoint-234 and are newly initialized: ['lm_head.0.linear_head.weight', 'lm_head.1.linear_head.weight', 'lm_head.10.linear_head.weight', 'lm_head.11.linear_head.weight', 'lm_head.12.linear_head.weight', 'lm_head.13.linear_head.weight', 'lm_head.14.linear_head.weight', 'lm_head.15.linear_head.weight', 'lm_head.16.linear_head.weight', 'lm_head.17.linear_head.weight', 'lm_head.18.linear_head.weight', 'lm_head.19.linear_head.weight', 'lm_head.2.linear_head.weight', 'lm_head.20.linear_head.weight', 'lm_head.21.linear_head.......]
- This IS expected if you are initializing LayerWiseMiniCPMForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LayerWiseMiniCPMForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

我在部署没有微调的bge-reranker-v2-minicpm-layerwise 时,并没有上述的信息。

在对微调后的模型进行测试: 测试case:

 [
        ["what is panda?", "hi"],
        ["what is panda?", "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China."]
    ]

模型输出:[11.1875, 1.7265625] 这个分数不符合预期。

问题: 1、结果相差太大,感觉是部署的问题?因为提示信息显示有部分参数没有初始化。微调生成的模型文件相对于原生bge-reranker-v2-minicpm-layerwise 少了部分文件,我的做法是:将缺少的文件,例如config.json 等 补全到微调生成的模型文件中。

2、根据您的回复【训练的层很多的话,loss刚开始确实是比较大的】,如果要修改微调的层数,是调整参数 --start_layer 的值?,是否有推荐的值?

545999961 commented 8 months ago
  1. 这里采用的是微调 minicpm-2b 模型的代码。由于 minicpm-2b 模型没有分类头,因此需要进行初始化。你这里似乎使用了已经微调好的 reranker 进行进一步微调,现在的微调代码是直接初始化最后的分类头的,所以会造成效果不佳。若直接使用 minicpm-2b 进行微调,效果可能会好点。 我们后续会整理出直接微调 minicpm-reranker 的代码,以便从 bge-reranker-v2-minicpm-layerwise 起进行微调。
  2. 当 start_layer 参数越大时,最终的效果可能会更好。
zhenpengguo commented 8 months ago

谢谢,close~