RUC-NLPIR / FlashRAG

⚡FlashRAG: A Python Toolkit for Efficient RAG Research
https://arxiv.org/abs/2405.13576
MIT License
1.17k stars 85 forks source link

Add REPLUG style LLM supervised signal calculation #17

Closed BUAADreamer closed 4 months ago

BUAADreamer commented 4 months ago

Motivation

It is worth noting that the current FlashRAG implementation of REPLUG uses a generalized retriever directly, but the original full method, i.e., REPLUG LSR, uses an LM-supervised trained retriever and get improved performance

Feature

I have implemented the process of obtaining LM supervisory query-document datasets in REPLUG. For every question, I get the likelihood of the ground truth when the LM uses every document retrieved from top-k retrieved ones in context. See examples/methods/get_lm_probs.py for details.

Example

Here is an example of obtaining NQ test split query-document dataset.

cd examples/methods
python3 utils/get_lm_probs_dataset.py \ 
--dataset_name nq \
--split test \
--num 4000 \ # number of queries
--gpu_id 0 \
--output lmsft.jsonl \ # jsonl output path
--topk 20  # document number for every query

And we could get the following dataset:

{"query":"xxx",pos:["yyy" ,"zzz"],scores: [0.2, 0.8]}
...

Tests

I implement corresponding training code in RAG-Retrieval #pr14. I fine-tuned the e5-v2-base used in FlashRAG, rebuilt the index, and then used the same code to test the new retriever. Performance before and after fine-tuning the retriever is below:

image

This proves the dataset-building process in this PR is useful and right.