FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
7.32k stars 532 forks source link

m3 embedding model pos score(similarity) is getting lower #747

Closed jhyeom1545 closed 5 months ago

jhyeom1545 commented 5 months ago

I am fine-tuning the m3-base or m3-base-unsupervised. I have a question about the fine-tuning result.

I'm fine-tuning using the format of Toy Data in Unified Fine-tuning. I'm using about 200,000+ pieces of data.

However, after fine-tuning, the neg score improved (neg passage and query similarity decreased) Even after fine-tuning, the pos score did not improve, but rather decreased (the similarity between pos passage and query decreases). Can you tell me why these results are coming out?

We looked at the average similarity of 100 same queries The baai/bge-m3 model had a sim of about 0.6 and was lowered to 0.5 or 0.4 after fine-tuning.

In order to increase similarity, I tried to learn by attaching instruction, but the results were similar.

Is there a way for me to improve the similarity?

I'm applying knowledge distillation using the score of m3-reranker, and fine-tuning with the template as below.

CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup torchrun --nproc_per_node 8 --master_port 29501 > './logs/train_m3_240502_score.out' \
-m FlagEmbedding.BGE_M3.run \
--deepspeed './deepspeed/ds_config.json' \
--knowledge_distillation True \
--output_dir './result/240502/' \
--model_name_or_path 'BAAI/bge-m3-unsupervised' \
--normlized True \
--temperature 0.02 \
--do_train  \
--train_data './final_split' \
--cache_path './cache' \
--per_device_train_batch_size 1 \
--query_max_len 512 \
--passage_max_len 8192 \
--small_threshold 200 \
--drop_threshold 200 \
--fp16  \
--train_group_size 8 \
--learning_rate 5e-6 \
--num_train_epochs 10 \
--negatives_cross_device True \
--logging_steps 10 \
--warmup_ratio 0.1 \
--weight_decay 0.01 \
--overwrite_output_dir True \
--gradient_checkpointing \
--save_strategy 'steps' \
--save_steps 424 \
--save_total_limit 30 \
--sentence_pooling_method cls \
--same_task_within_batch True \
--shuffle_ratio 0.002 \
--enable_sub_batch True \
--unified_finetuning True \
--use_self_distill True &
ngothanhnam0910 commented 5 months ago

@jhyeom1545 Hi, I want to ask you that: when you using the score of m3-reranker for finetune ,Do you normalize the scores of the reranker to (0,1) before finetune?

staoxiao commented 5 months ago

@jhyeom1545 , It could be that there is noise in your data, i.e. wrong positive and negative samples. You can try to filter the training data.

staoxiao commented 5 months ago

@ngothanhnam0910 , the normalized scores are not appropriate for fine-tuning because the distribution is too smoothed after softmax. You should use the scores before normalizing.

jhyeom1545 commented 5 months ago

@ngothanhnam0910 Hi, I'm using data not normalized.

bellow this comment, I received an answer that I should use data that is not normalized. https://github.com/FlagOpen/FlagEmbedding/issues/701

jhyeom1545 commented 5 months ago

@staoxiao Hi, Thanks for your comment.

In order to remove noise from the data, we use data with a score of 2 or more in the reranker as positive data.

I use hn-mine from 35 to 60 for negative data through baai/bge-m3 model.

In the case of my fine-tuning model, it seems that negative is being learned, but the similarity drops to minus.

Is there anything else I can try to increase positive similarity?

I'm thinking of a strategy for step 2. I'm wondering if this will be useful. Step 1. Learn with train_group_size=8 to train the negative well (my fine-tuning model at this time, positive similarity score drops)

Step 2. Fine-tuned learning with train_group_size=1 to increase positive scores, 0 negative.(query, pos data is same as step 1, just not using neg data)

staoxiao commented 5 months ago

@jhyeom1545 , I guess the reason is that the negative samples are too challenging, so that the model has to reduce the scores. You can use a larger sample range (e.g., 35-60 -> 1-300).

Besides, lower scores of positive samples may not necessarily affect the ranking accuracy. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value.