Cannot reproduce the results

WenzhengZhang / EntQA

Pytorch implementation of EntQA paper

MIT License

61 stars 12 forks source link

Cannot reproduce the results #13

Closed mk2x15 closed 1 year ago

mk2x15 commented 1 year ago

Hi, Wenzheng,

Thanks for your great work! According to Table 1 and Table 3 in your paper, the test F1 and val F1 should be 85.8 and 87.5, respectively. But I got lower results as follows using the trained reader and reader inputs provided in this repo:

Test results: {"pred_total": 4760, "gold_total": 4485, "strong_correct_num": 3902} test recall 0.8700| test precision 0.8197 | test F1 0.8441 |

Val results {"pred_total": 5110, "gold_total": 4791, "strong_correct_num": 4323} val recall 0.9023 | val precision 0.846 | val F1 0.8732 |

May I know if the above results are reasonable? and how can I reproduce the results in your paper? Thanks!

WenzhengZhang commented 1 year ago

Hi, Your results make sense. We report GERBIL test F1 score rather than local evaluation score in Table 1. We also notice that GERBIL evaluation will give slightly different performance from local evaluation. Since GERBIL is a blackbox tool, I cannot tell you why that happens.

mk2x15 commented 1 year ago

Thanks for your prompt reply. I have another question about the running time. I train the reader model on three Quadro RTX 8000 GPUs with 48GB memory, however, it takes more than 6 hours to finish only one epoch, and the inference speed of the reader model is also very slow, like more than 3 hours with multiple GPUs. May I know if this is normal? Or EntQA is less efficient than other systems? Thanks!

WenzhengZhang commented 1 year ago

I use A100 with 40GB memory rather than RTX 8000 for all my experiments. You can check section 3.1 of the paper to get a sense of running time for my setting. I don't know what other systems you compare with EntQA. I don't agree that EntQA is less efficient. We have compared EntQA with GENRE in inference time on the AIDA validation set (each using 1 GPU on the same machine). GENRE takes 1 hour and 10 minutes, excluding 31 minutes to first build a prefix tree. For EntQA, the runtime is linear in K, and we have:

K=100: 1 hour and 36 minutes, validation F1 87.5
K=50: 49 minutes, validation F1 87.4
K=20: 20 minutes, validation F1 87.0

We can obtain a significant speedup at a minor cost in performance by decreasing the number of candidate entities, which may be another useful feature of the model in controlling the speed-performance tradeoff.