bge 第二次finetune效果不理想

wangzhao88 commented 2 months ago

您好，非常感谢你们的工作，本着极大的兴趣，我复现了一遍bge的微调流程，详细如下：【第一次微调】使用Chinese-roberta作为初始模型，然后从https://data.baai.ac.cn/details/BAAI-MTP下载了data_zh.zip数据，然后进行了第一次微调，得到模型bge_finetune_1

【第二次微调】根据论文提供的数据集地址下载了cMedQA2，dureader，mmarco，Multi-CPR，T2Ranking，ocnli，cmnli和nli_zh等八个数据集，然后做了如下操作：（1）对于ocnli，cmnli和nli_zh，首先根据label制作了数据集，数据集格式为{"query":query, "pos":[pos_1, pos_2, ..., pos_N], "neg":[neg_1, neg_2, ..., neg_N]}，然后使用huggingface模型（"shibing624/text2vec-base-chinese"）过滤query和pos_x相似度小于0.43的样本，如果query没有pos了那么整个样本都会被删除。（2）对于其他数据集，其可能是query/context或者title/context的组合，首先使用huggingface模型（"shibing624/text2vec-base-chinese"）过滤query/context或title/context相似度小于0.43的样本，得到数据集。数据集格式为{"query":query, "pos":[pos]}，然后使用bge_finetune_1作为模型，进行难样本挖掘，使用的挖掘代码是https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/finetune/hn_mine.py，得到的数据集格式为{"query":query, "pos":[pos], "neg":[neg]}，使用bge_finetune_1作为初始模型进行训练，得到bge_finetune_2

【结论】第一次微调得到的模型效果基本和预期相符合，但是第二次微调后的模型在C_MTEB上的特定数据集的精度有下降，详细结果如下：截屏2024-06-07 下午5 28 20

Chinese-roberta + Finetune(MTP-unlabel zh)对应bge_fintune_1模型，Chinese-roberta + Finetune(MTP-unlabel zh) + Finetune(MTP-label zh)对应bge_finetune_2模型。

训练代码的超参数如下： torchrun --nproc_per_node 8 \ . . . --learning_rate 1e-5 \ --fp16 \ --num_train_epochs 5 \ --gradient_accumulation_steps 12 \ --per_device_train_batch_size 200 \ --dataloader_drop_last True \ --max_example_num_per_dataset 10000000000 \ --normlized True \ --temperature 0.02 \ --query_max_len 64 \ --passage_max_len 256 \ --train_group_size 2 \ --negatives_cross_device \ --logging_steps 10 \ --save_steps 10000 \ --query_instruction_for_retrieval ""

请问我的数据处理方式是否还有可以改进的地方呢，如果能稍微指导一下，感激不尽。

staoxiao commented 2 months ago

Thanks for your attention to our work! Here are some suggestion for constructing data:

For ocnli，cmnli and nli_zh, we use the text labeled as 0 as negative example.
We just mine hard negatives for dureade and mmarco. For t2raning, we use the hard negative provided by the official. For cMedQA2, we don't mine negative.

yangliuIOC commented 2 months ago

困难样本挖掘之后，确实效果会降低，我理解有可能是样本区分度不够，其实本身都是很相近的，只不过非要抽出来一个当做pos，其他的是neg

lulu51230 commented 2 months ago

请问如何判断该数据集是否需要挖掘难负例

FlagOpen / FlagEmbedding

bge 第二次finetune效果不理想 #869