Open ma787639046 opened 2 years ago
Training seems correct. Are you using constrained search during the evaluation?
Also, with BLINK you do not need to use convert_kilt_to_fairseq.
Thanks for your quick response.
1) I use constrained search during the evaluation. The trie is downloaded from kilt_titles_trie_dict.pkl, and I use evaluate_kilt_dataset.py with beam=10, max_len_a=384, max_len_b=15.
2) I get the BLINK train set in JSON Line format from blink-train-kilt.jsonl. It seems that this file is structured in the same way as other KILT datasets. So I just cat the blink-train-kilt.jsonl and other 8 KILT train jsonl files mentioned in the paper to one single file. Then I shuffle this JSON Line file with random.shuffle() using python. I cat all 11 dev jsonl files of KILT to one file as development set. Then use the script convert_kilt_to_fairseq.py & preprocess_fairseq.sh to process above files.
Am I doing these the right way?
Thanks again!
@nicola-decao
Yes, you are doing it correctly then. I am not sure what is going wrong. Are you sure you are training with the same batch size and number of steps as reported in the paper?
Yes, I rerun the whole finetune process on 8 V100 GPUs torch1.6.0+cuda10.1. I directly use the training script train.sh with max-tokens per GPU to 1024, update-freq to 128, max-update to 200000, which should be the same hypermeters reported in the appendix A.3. I get the following results.
model_name | FEV | AY2 | WnWi | WnCw | T-REx | zsRE | NQ | HoPo | TQA | ELI5 | WoW | Avg |
---|---|---|---|---|---|---|---|---|---|---|---|---|
genre_fairseq_wikipage_retrieval (provided) | 0.84681 | 0.92747 | 0.87691 | 0.7053 | 0.7968 | 0.94844 | 0.64258 | 0.51821 | 0.71114 | 0.1347 | 0.5632 | 0.69742 |
My reproduced model | 0.84203 | 0.92559 | 0.88516 | 0.71048 | 0.7288 | 0.86198 | 0.60416 | 0.40625 | 0.69938 | 0.13603 | 0.58481 | 0.67133 |
T-REx, zsRE, NQ, HoPo, TQA are still lower than expected.
That is weird, but I do not know how to help. I do not work at Facebook/ Meta anymore, so I cannot re-run experiments or check the original code that was launched. Note: I run on more GPUs.
Hi, I was trying to reproduce the Page-level Document Retrieval of GENRE. But the dev score is significantly lower than the model you provided fairseq_wikipage_retrieval.
Here are my details for training:
Training set: Following Section 4.1 in the paper, I mix and shuffle the BLINK & 8 KILT jsonl training files to a single file, and using scripts convert_kilt_to_fairseq.py & preprocess_fairseq.sh to process the training file.
Dev set: I just cat all 11 KILT dev jsonl files to one single jsonl file, then used the same process mentioned above to process it.
Training Hypermeters: I use the script train.sh for training. I set the keep-best-checkpoints=1 to save the model that performs best on the dev set.
Following Appendix A.3, I notice that 128 GPUs were used with max-tokens=1024 and update-freq=1. I use 16 GPUs for training, so I use max-tokens=8192 to keep the Total max tokens per update=128*1024.
Here are the dev results of the model you provided fairseq_wikipage_retrieval and my own reproduced model for KILT.
The results for TREX, structured_zeroshot, NQ, and HotpotQA are lower than the model you provided. Could you give me some help to find out anything wrong?
Thank you very much. @nicola-decao