Main Features

LM supervised tuning by code under train/embedding
- pair kl loss
- pair_score style data load and process
- complete a new pipeline of training while keeping the original ones
- specific usage of pair_score data could be found at train/embedding/README.md
Format some code for better readablity
modify json to jsonl for standardization

TODO

[x] Reproduce REPLUG style fine-tuning by using signal from LLaMA3-8B-Instruct and get a retriever model.
[x] Use FlashRAG to verify whether the fine-tuned retriever model can improve the REPLUG pipeline RAG performance.

Tests

I fine-tuned the e5-v2-base used in FlashRAG, rebuilt the index, and then used the same code to test the new retriever. Performance before and after fine-tuning the retriever is below:

Method	NQ EM Score	NQ F1 Score
REPLUG	31.36	41.53
+ finetune	36.65	46.78

This proves the dataset-building process in this PR is useful and right, and training process is somehow right.

Finetuning command:

cd rag-retrieval/train/embedding

CUDA_VISIBLE_DEVICES="0" python3 train_embedding.py  \
--model_name_or_path "intfloat/e5-base-v2" \
--dataset "../../../../FlashRAG/build/methods/lmsft.jsonl" \
--output_dir "./output/lmsft_example" \
--batch_size 64 \
--lr 2e-5 \
--epochs 5 \
--save_on_epoch_end 1 \
--gradient_accumulation_steps 1  \
--log_with 'wandb' \
--warmup_proportion 0.1 \
--neg_nums 15 \
--temperature 0.01 \
--query_max_len 128 \
--passage_max_len 512

NLPJCL / RAG-Retrieval

Add LM supervised tuning by using KL Loss #14

Main Features

TODO

Tests

Finetuning command: