NLPJCL / RAG-Retrieval

Unify Efficient Fine-tuning of RAG Retrieval, including Embedding, ColBERT, ReRanker.
MIT License
506 stars 45 forks source link

Add LM supervised tuning by using KL Loss #14

Closed BUAADreamer closed 5 months ago

BUAADreamer commented 5 months ago

Main Features

TODO

Tests

I fine-tuned the e5-v2-base used in FlashRAG, rebuilt the index, and then used the same code to test the new retriever. Performance before and after fine-tuning the retriever is below:

Method NQ EM Score NQ F1 Score
REPLUG 31.36 41.53
+ finetune 36.65 46.78

This proves the dataset-building process in this PR is useful and right, and training process is somehow right.

Finetuning command:

cd rag-retrieval/train/embedding

CUDA_VISIBLE_DEVICES="0" python3 train_embedding.py  \
--model_name_or_path "intfloat/e5-base-v2" \
--dataset "../../../../FlashRAG/build/methods/lmsft.jsonl" \
--output_dir "./output/lmsft_example" \
--batch_size 64 \
--lr 2e-5 \
--epochs 5 \
--save_on_epoch_end 1 \
--gradient_accumulation_steps 1  \
--log_with 'wandb' \
--warmup_proportion 0.1 \
--neg_nums 15 \
--temperature 0.01 \
--query_max_len 128 \
--passage_max_len 512