FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
6.91k stars 500 forks source link

how could I set range_for_sampling during reranker model fine tune? thnks #404

Open Yazooliu opened 8 months ago

Yazooliu commented 8 months ago

In https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives I saw that: range_for_sampling: where to sample negative. For example, 2-100 means sampling negative from top2-top200 documents. You can set larger value to reduce the difficulty of negatives (e.g., set it 60-300 to sample negatives from top50-300 passages)

How I understand this? I may ready 2w+ finetune data with query + pos +neg to fine tune reranker model In toy data json, I saw the neg data length of every query is 7. If I also set it to 7 in my finetune data json. how can I set the range_for_sampling parameter??

I not clear about: For example, 2-100 means sampling negative from top2-top200 documents. You can set larger value to reduce the difficulty of negatives (e.g., set it 60-300 to sample negatives from top50-300 passages)

during inference, first I will retrieval top50-top200 and then use these to do rerank. 1、Is any relationship about neg examples length(eg, 7) with range_for_sampling?、 2、As above mentioned top50-top200 , how could I set range_for_sampling? 3、do you any suggestion about reranker range in RAG applicaiton ? top50? top100? top150? or any good paper to support?

thanks for your talk. BR Yazhou

staoxiao commented 8 months ago

Hi, I updated the readme to make it more clear: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives

You can set negative_number to change the number of sampled negatives. range_for_sampling is used to create the corpus for sampling. For example, 60-300 means that sample negative from TopK-Docs[60:300].

There is no a best setting that can be used in various senarios. I suggest to select a setting which has the best performance in your task.

Yazooliu commented 8 months ago

Hi, I updated the readme to make it more clear: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives

You can set negative_number to change the number of sampled negatives. range_for_sampling is used to create the corpus for sampling. For example, 60-300 means that sample negative from TopK-Docs[60:300].

There is no a best setting that can be used in various senarios. I suggest to select a setting which has the best performance in your task.

thanks for your talk and I will read your hn_mine.py to know the detail.

BR Yazhou