损失函数绘图 - Githubissues

sevenandseven commented 5 months ago

你好，我在微调embedding和reranker，模型的过程中发现，两种模型微调时损失i函数都呈现上下震荡的情况，训练1个epoch，损失函数无法收敛，我用了9100条数据，请问这个情况是什么原因，我应该怎么解决？ W B Chart 2024_5_20 14_54_02

staoxiao commented 5 months ago

I think this loss curve is normal. You need to further smooth this curve to observe its trend. Besides, you can set --report_to tensorboard to save the loss by tensorboard, and then use tensorboard tool to show the loss curve.

sevenandseven commented 5 months ago

I think this loss curve is normal. You need to further smooth this curve to observe its trend. Besides, you can set --report_to tensorboard to save the loss by tensorboard, and then use tensorboard tool to show the loss curve.

ok,thanks for you reply. i would like to ask, during the fine-tuning process, besides adjusting hyperparameters, are there any other better methods to improve the fine-tuning effect?

staoxiao commented 5 months ago

I think this loss curve is normal. You need to further smooth this curve to observe its trend. Besides, you can set --report_to tensorboard to save the loss by tensorboard, and then use tensorboard tool to show the loss curve.

ok,thanks for you reply. i would like to ask, during the fine-tuning process, besides adjusting hyperparameters, are there any other better methods to improve the fine-tuning effect?

The most important thing is the quality of data. Firstly, You need to ensure that the positive samples are highly relevant. Besides, You can use this script to mine hard negatives: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives, and change the argument range_for_sampling to adjust the hardness of negatives.

sevenandseven commented 5 months ago

adjust the hardness of negatives

ok, thanks for you reply.

sevenandseven commented 5 months ago

①Overall, my loss function seems to be trending correctly, but sometimes the loss suddenly increases significantly. What could be the reasons for these sudden and severe increases?”

W B Chart 2024_5_23 10_46_23

②When I adjust the command to the following command, it starts unable to train. What could be the reason for this situation, and how can I fix it? this is command: CUDA_VISIBLE_DEVICES=0,5 torchrun --standalone --nnodes=1 --nproc_per_node 2 -m FlagEmbedding.baai_general_embedding.finetune.run \ --output_dir ./results/v3.0/bge_small_zhv15_1epoch_noise5 \ --model_name_or_path /media/ai/HDD/Teamwork/LLM_Embedding_model/Embedding/Embedding/bge-small-zh-v1.5 \ --train_data /media/ai/HDD/Teamwork/wangenzhi/FlagEmbedding-master/official/FlagEmbedding/fine_data/datav1/query_answer_v23-minedHN-new-0407-30neg.jsonl \ --learning_rate 1e-5 \ --fp16 \ --num_train_epochs 2 \ --per_device_train_batch_size 64 \ --gradient_accumulation_steps 256 \ --dataloader_drop_last True \ --normlized True \ --temperature 0.02 \ --query_max_len 64 \ --passage_max_len 256 \ --train_group_size 1 \ --logging_steps 1 \ --logging_strategy steps \ --save_steps 100 \ --save_strategy steps \ --save_total_limit 10 \ --overwrite_output_dir true \ --report_to wandb

this is loss plot: W B Chart 2024_5_23 14_08_30

FlagOpen / FlagEmbedding

损失函数绘图 #804