Fail to Reproduce the dev score of GENRE Document Retrieval

ma787639046 commented 2 years ago

Hi, I was trying to reproduce the Page-level Document Retrieval of GENRE. But the dev score is significantly lower than the model you provided fairseq_wikipage_retrieval.

Here are my details for training:

Training set: Following Section 4.1 in the paper, I mix and shuffle the BLINK & 8 KILT jsonl training files to a single file, and using scripts convert_kilt_to_fairseq.py & preprocess_fairseq.sh to process the training file.

Dev set: I just cat all 11 KILT dev jsonl files to one single jsonl file, then used the same process mentioned above to process it.

Training Hypermeters: I use the script train.sh for training. I set the keep-best-checkpoints=1 to save the model that performs best on the dev set.

Following Appendix A.3, I notice that 128 GPUs were used with max-tokens=1024 and update-freq=1. I use 16 GPUs for training, so I use max-tokens=8192 to keep the Total max tokens per update=128*1024.

Here are the dev results of the model you provided fairseq_wikipage_retrieval and my own reproduced model for KILT.

model_name	fever	aidayago2	wn	cweb	trex	structured_zerosho	nq	hotpotqa	triviaqa	eli5	wow
genre_fairseq_wikipage_retrieval (provided)	0.846907	0.927467	0.876914	0.705305	0.7968	0.948443	0.642228	0.518214	0.71114	0.134705	0.563196
My reproduced model	0.826217	0.927048	0.874264	0.713342	0.716	0.864125	0.576665	0.399821	0.701064	0.13935	0.570727

The results for TREX, structured_zeroshot, NQ, and HotpotQA are lower than the model you provided. Could you give me some help to find out anything wrong?

Thank you very much. @nicola-decao

nicola-decao commented 2 years ago

Training seems correct. Are you using constrained search during the evaluation?

nicola-decao commented 2 years ago

Also, with BLINK you do not need to use convert_kilt_to_fairseq.

ma787639046 commented 2 years ago

Thanks for your quick response.

1) I use constrained search during the evaluation. The trie is downloaded from kilt_titles_trie_dict.pkl, and I use evaluate_kilt_dataset.py with beam=10, max_len_a=384, max_len_b=15.

2) I get the BLINK train set in JSON Line format from blink-train-kilt.jsonl. It seems that this file is structured in the same way as other KILT datasets. So I just cat the blink-train-kilt.jsonl and other 8 KILT train jsonl files mentioned in the paper to one single file. Then I shuffle this JSON Line file with random.shuffle() using python. I cat all 11 dev jsonl files of KILT to one file as development set. Then use the script convert_kilt_to_fairseq.py & preprocess_fairseq.sh to process above files.

Am I doing these the right way?

Thanks again!

ma787639046 commented 2 years ago

@nicola-decao

nicola-decao commented 2 years ago

Yes, you are doing it correctly then. I am not sure what is going wrong. Are you sure you are training with the same batch size and number of steps as reported in the paper?

ma787639046 commented 2 years ago

Yes, I rerun the whole finetune process on 8 V100 GPUs torch1.6.0+cuda10.1. I directly use the training script train.sh with max-tokens per GPU to 1024, update-freq to 128, max-update to 200000, which should be the same hypermeters reported in the appendix A.3. I get the following results.

model_name	FEV	AY2	WnWi	WnCw	T-REx	zsRE	NQ	HoPo	TQA	ELI5	WoW	Avg
genre_fairseq_wikipage_retrieval (provided)	0.84681	0.92747	0.87691	0.7053	0.7968	0.94844	0.64258	0.51821	0.71114	0.1347	0.5632	0.69742
My reproduced model	0.84203	0.92559	0.88516	0.71048	0.7288	0.86198	0.60416	0.40625	0.69938	0.13603	0.58481	0.67133

T-REx, zsRE, NQ, HoPo, TQA are still lower than expected.

nicola-decao commented 2 years ago

That is weird, but I do not know how to help. I do not work at Facebook/ Meta anymore, so I cannot re-run experiments or check the original code that was launched. Note: I run on more GPUs.

facebookresearch / GENRE

Fail to Reproduce the dev score of GENRE Document Retrieval #90