Open jiacheng-ye opened 2 years ago
I've figured out solusions about above questions. With the default parameters in codebase, I got 26.15 with BM25.
However, the EPR performs even worse (22.9) after training the BERT-based retriever. I run EPR with python run.py dataset=break dpr_epochs=120 gpus=1 partition=NLP
. I'm not sure where it went wrong :(
Waiting for your help and thanks in advance.
Hey, this might be related to the fact that you are using a single gpu, the DPR setup benefits greatly from a large batch size. The result of 31.9% LFEM from the paper is using 4 GPUs.
Hi,
Here is the full list of commends:
#!/bin/bash
#SBATCH --job-name=epr_mtop-null_v4
#SBATCH --output=outputs/epr_mtop-null_v4/out.txt
#SBATCH --error=outputs/epr_mtop-null_v4/out.txt
#SBATCH --partition=NLP
#SBATCH --time=12000
#SBATCH --quotatype=reserved
#SBATCH --gres=gpu:2
srun python find_bm25.py output_path=$PWD/data/bm25_mtop-null_a_train.json \
dataset_split=train setup_type=a task_name=mtop +ds_size=null L=50 \
hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun accelerate launch --num_processes 2 --main_process_port 24821 \
scorer.py example_file=$PWD/data/bm25_mtop-null_a_train.json \
setup_type=qa \
output_file=$PWD/data/bm25_mtop-null_a_train_scoredqa.json \
batch_size=8 +task_name=mtop +dataset_reader.ds_size=null \
hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/train_dense_encoder.py train_datasets=[epr_dataset] \
train=biencoder_local \
output_dir=$PWD/experiments/epr_mtop-null_a_train \
datasets.epr_dataset.file=$PWD/data/bm25_mtop-null_a_train_scoredqa.json \
datasets.epr_dataset.setup_type=qa datasets.epr_dataset.hard_neg=true \
datasets.epr_dataset.task_name=mtop datasets.epr_dataset.top_k=5 \
+gradient_accumulation_steps=1 train.batch_size=120 \
train.num_train_epochs=30 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/generate_dense_embeddings.py \
model_file=$PWD/experiments/epr_mtop-null_a_train/dpr_biencoder.29 \
ctx_src=dpr_epr shard_id=0 num_shards=1 \
out_file=$PWD/experiments/epr_mtop-null_a_train/dpr_enc_index \
ctx_sources.dpr_epr.setup_type=qa \
ctx_sources.dpr_epr.task_name=mtop +ctx_sources.dpr_epr.ds_size=null \
hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/dense_retriever.py \
model_file=$PWD/experiments/epr_mtop-null_a_train/dpr_biencoder.29 \
qa_dataset=qa_epr ctx_datatsets=[dpr_epr] \
datasets.qa_epr.dataset_split=validation \
encoded_ctx_files=["$PWD/experiments/epr_mtop-null_a_train/dpr_enc_index_*"] \
out_file=$PWD/data/validation_epr_mtop-null_a_train_prompts.json \
ctx_sources.dpr_epr.setup_type=qa \
ctx_sources.dpr_epr.task_name=mtop datasets.qa_epr.task_name=mtop \
hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun accelerate launch --num_processes 2 --main_process_port 24821 \
inference.py \
prompt_file=$PWD/data/validation_epr_mtop-null_a_train_prompts.json \
task_name=mtop \
output_file=$PWD/data/validation_epr_mtop-null_a_train_prede.json \
batch_size=10 max_length=1950 \
hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
On mtop dataset, number of training data is 95961. The training loss is around 0.07 after 30 epoches, avg loss per batch is 0.071158.
As I'm using A100 80G, I only use two gpus as it is sufficient for 120 batch size. Finally, I got 25.19 on break and 50.87 on mtop. Any advice would be helpful 😂
I think dpr_epochs=120 is the correct hyperparameter parameter, the contrastive learning objective is able to improve greatly with more compute. I think the default hp of dpr_epochs=30 was for where I needed to run a large number of experiments. Recreate our results 120 epochs are necessary. I think..
I got 49.17 after training 120 epochs on mtop, it's still weird... 😂
I will run some tests of my own and try to make sense of this thing. I'll keep you updated!
Hi Ohad, do you have any updates? :)
Nice work! Anyone know the enviroment requirement file of EPR?
Hi Ohad, Thanks for your awesome work! I have several questions when using the code: 1) how to directly perform BM25 retrieval and few-shot inference on the validation set? (26.0 as shown in Table 3) 2) how to evaluate results given the predictions?