Open Yazooliu opened 8 months ago
Hi, I updated the readme to make it more clear: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives
You can set negative_number
to change the number of sampled negatives.
range_for_sampling
is used to create the corpus for sampling. For example, 60-300 means that sample negative from TopK-Docs[60:300]
.
There is no a best setting that can be used in various senarios. I suggest to select a setting which has the best performance in your task.
Hi, I updated the readme to make it more clear: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives
You can set
negative_number
to change the number of sampled negatives.range_for_sampling
is used to create the corpus for sampling. For example, 60-300 means that sample negative fromTopK-Docs[60:300]
.There is no a best setting that can be used in various senarios. I suggest to select a setting which has the best performance in your task.
thanks for your talk and I will read your hn_mine.py to know the detail.
BR Yazhou
In https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives I saw that: range_for_sampling: where to sample negative. For example, 2-100 means sampling negative from top2-top200 documents. You can set larger value to reduce the difficulty of negatives (e.g., set it 60-300 to sample negatives from top50-300 passages)
How I understand this? I may ready 2w+ finetune data with query + pos +neg to fine tune reranker model In toy data json, I saw the neg data length of every query is 7. If I also set it to 7 in my finetune data json. how can I set the range_for_sampling parameter??
I not clear about: For example, 2-100 means sampling negative from top2-top200 documents. You can set larger value to reduce the difficulty of negatives (e.g., set it 60-300 to sample negatives from top50-300 passages)
during inference, first I will retrieval top50-top200 and then use these to do rerank. 1、Is any relationship about neg examples length(eg, 7) with range_for_sampling?、 2、As above mentioned top50-top200 , how could I set range_for_sampling? 3、do you any suggestion about reranker range in RAG applicaiton ? top50? top100? top150? or any good paper to support?
thanks for your talk. BR Yazhou