OhadRubin / EPR

59 stars 6 forks source link

How to evaluate results after prediction? #3

Open jiacheng-ye opened 2 years ago

jiacheng-ye commented 2 years ago

Hi Ohad, Thanks for your awesome work! I have several questions when using the code: 1) how to directly perform BM25 retrieval and few-shot inference on the validation set? (26.0 as shown in Table 3) 2) how to evaluate results given the predictions?

jiacheng-ye commented 2 years ago

I've figured out solusions about above questions. With the default parameters in codebase, I got 26.15 with BM25.

However, the EPR performs even worse (22.9) after training the BERT-based retriever. I run EPR with python run.py dataset=break dpr_epochs=120 gpus=1 partition=NLP. I'm not sure where it went wrong :( Waiting for your help and thanks in advance.

OhadRubin commented 2 years ago

Hey, this might be related to the fact that you are using a single gpu, the DPR setup benefits greatly from a large batch size. The result of 31.9% LFEM from the paper is using 4 GPUs.

jiacheng-ye commented 2 years ago

Hi,

Here is the full list of commends:

#!/bin/bash
#SBATCH --job-name=epr_mtop-null_v4
#SBATCH --output=outputs/epr_mtop-null_v4/out.txt
#SBATCH --error=outputs/epr_mtop-null_v4/out.txt
#SBATCH --partition=NLP
#SBATCH --time=12000
#SBATCH --quotatype=reserved
#SBATCH --gres=gpu:2
srun python find_bm25.py output_path=$PWD/data/bm25_mtop-null_a_train.json \
     dataset_split=train setup_type=a task_name=mtop +ds_size=null L=50 \
     hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun accelerate launch --num_processes 2 --main_process_port 24821 \
     scorer.py example_file=$PWD/data/bm25_mtop-null_a_train.json \
     setup_type=qa \
     output_file=$PWD/data/bm25_mtop-null_a_train_scoredqa.json \
     batch_size=8     +task_name=mtop +dataset_reader.ds_size=null \
     hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/train_dense_encoder.py train_datasets=[epr_dataset] \
     train=biencoder_local \
     output_dir=$PWD/experiments/epr_mtop-null_a_train \
     datasets.epr_dataset.file=$PWD/data/bm25_mtop-null_a_train_scoredqa.json \
     datasets.epr_dataset.setup_type=qa  datasets.epr_dataset.hard_neg=true \
     datasets.epr_dataset.task_name=mtop     datasets.epr_dataset.top_k=5 \
     +gradient_accumulation_steps=1 train.batch_size=120 \
     train.num_train_epochs=30 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/generate_dense_embeddings.py \
     model_file=$PWD/experiments/epr_mtop-null_a_train/dpr_biencoder.29 \
     ctx_src=dpr_epr shard_id=0 num_shards=1 \
     out_file=$PWD/experiments/epr_mtop-null_a_train/dpr_enc_index \
     ctx_sources.dpr_epr.setup_type=qa \
     ctx_sources.dpr_epr.task_name=mtop +ctx_sources.dpr_epr.ds_size=null \
     hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/dense_retriever.py \
     model_file=$PWD/experiments/epr_mtop-null_a_train/dpr_biencoder.29 \
     qa_dataset=qa_epr ctx_datatsets=[dpr_epr] \
     datasets.qa_epr.dataset_split=validation \
     encoded_ctx_files=["$PWD/experiments/epr_mtop-null_a_train/dpr_enc_index_*"] \
     out_file=$PWD/data/validation_epr_mtop-null_a_train_prompts.json \
     ctx_sources.dpr_epr.setup_type=qa \
     ctx_sources.dpr_epr.task_name=mtop datasets.qa_epr.task_name=mtop \
     hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun accelerate launch --num_processes 2 --main_process_port 24821 \
     inference.py \
     prompt_file=$PWD/data/validation_epr_mtop-null_a_train_prompts.json \
     task_name=mtop \
     output_file=$PWD/data/validation_epr_mtop-null_a_train_prede.json \
     batch_size=10 max_length=1950 \
     hydra.run.dir=$PWD/outputs/epr_mtop-null_v4

On mtop dataset, number of training data is 95961. The training loss is around 0.07 after 30 epoches, avg loss per batch is 0.071158.

As I'm using A100 80G, I only use two gpus as it is sufficient for 120 batch size. Finally, I got 25.19 on break and 50.87 on mtop. Any advice would be helpful 😂

OhadRubin commented 2 years ago

I think dpr_epochs=120 is the correct hyperparameter parameter, the contrastive learning objective is able to improve greatly with more compute. I think the default hp of dpr_epochs=30 was for where I needed to run a large number of experiments. Recreate our results 120 epochs are necessary. I think..

jiacheng-ye commented 2 years ago

I got 49.17 after training 120 epochs on mtop, it's still weird... 😂

OhadRubin commented 2 years ago

I will run some tests of my own and try to make sense of this thing. I'll keep you updated!

jiacheng-ye commented 2 years ago

Hi Ohad, do you have any updates? :)

RobertMarton commented 2 years ago

Nice work! Anyone know the enviroment requirement file of EPR?