Results Regarding Table 3

Hello, Thanks for sharing the code. I was trying to reproduce hotpot results according to your code and have encountered some problems. @xwhan I have tested retrieval evaluation using models/q_encoder.pt and the R@2 result is 65.5, which is quite close to Table 1 on paper. For Retriever training, I was using a batch size of 8 with a 8G Mem GPU. I trained with command CUDA_VISIBLE_DEVICES=0,1,2,3 python scripts/train_mhop.py --do_train --prefix new_train --predict_batch_size 1 --model_name roberta-base --train_batch_size 8 --learning_rate 2e-5 --fp16 --train_file data/hotpot/hotpot_train_with_neg_v0.json --predict_file data/hotpot/hotpot_dev_with_neg_v0.json --seed 16 --eval-period 500 --max_c_len 300 --max_q_len 70 --max_q_sp_len 350 --shared-encoder --warmup-ratio 0.1 for 5 epochs, and it seemed on the batch_train_loss the model has converged, and the mrr_1, mrr_2 for evaluation has both been above 95. However, when I was using the trained retrieval model from above for retrieval evaluation, I got R@2 23.6, instead of 63.7 indicated by the ablation study in Table 3. Would it be possible for you to provide more details on the Retriever training with hotpot dataset? For instance, how many epochs have you trained the model until convergence and what are the mrr_1, mrr_2 scores at the time of convergence? Also is it possible to provide the model after retriever training phase? Thank you again.

facebookresearch / multihop_dense_retrieval

Results Regarding Table 3 #25