It is worth noting that the current FlashRAG implementation of REPLUG uses a generalized retriever directly, but the original full method, i.e., REPLUG LSR, uses an LM-supervised trained retriever and get improved performance
Feature
I have implemented the process of obtaining LM supervisory query-document datasets in REPLUG. For every question, I get the likelihood of the ground truth when the LM uses every document retrieved from top-k retrieved ones in context. See examples/methods/get_lm_probs.py for details.
Example
Here is an example of obtaining NQ test split query-document dataset.
cd examples/methods
python3 utils/get_lm_probs_dataset.py \
--dataset_name nq \
--split test \
--num 4000 \ # number of queries
--gpu_id 0 \
--output lmsft.jsonl \ # jsonl output path
--topk 20 # document number for every query
I implement corresponding training code in RAG-Retrieval #pr14. I fine-tuned the e5-v2-base used in FlashRAG, rebuilt the index, and then used the same code to test the new retriever.
Performance before and after fine-tuning the retriever is below:
This proves the dataset-building process in this PR is useful and right.
Motivation
It is worth noting that the current FlashRAG implementation of REPLUG uses a generalized retriever directly, but the original full method, i.e., REPLUG LSR, uses an LM-supervised trained retriever and get improved performance
Feature
I have implemented the process of
obtaining LM supervisory query-document datasets
in REPLUG. For every question, I get the likelihood of the ground truth when the LM uses every document retrieved from top-k retrieved ones in context. Seeexamples/methods/get_lm_probs.py
for details.Example
Here is an example of obtaining NQ test split query-document dataset.
And we could get the following dataset:
Tests
I implement corresponding training code in RAG-Retrieval #pr14. I fine-tuned the e5-v2-base used in FlashRAG, rebuilt the index, and then used the same code to test the new retriever. Performance before and after fine-tuning the retriever is below:
This proves the dataset-building process in this PR is useful and right.