allenai / qasper-led-baseline

Apache License 2.0
50 stars 9 forks source link

The calculation of evidence-F1 using TF-IDF baseline #16

Closed JerrryNie closed 2 years ago

JerrryNie commented 2 years ago

Hi, for a query, the number of related pieces of evidence may be more than one. But as shown in https://github.com/allenai/qasper-led-baseline/blob/afd0fb96bf78ce8cd8157639c6f6a6995e4f9089/scripts/evidence_retrieval_heuristic_baselines.py#L45-L47 and https://github.com/allenai/qasper-led-baseline/blob/afd0fb96bf78ce8cd8157639c6f6a6995e4f9089/scripts/evidence_retrieval_heuristic_baselines.py#L12-L23 the script only selects the most similar paragraph of the query to calculate the Evidence-F1. I think the calculation may lead to lower evidence-F1 of the TF-IDF method.

pdasigi commented 2 years ago

@JerrryNie You are right that selecting only the top-1 TF-IDF result might underestimate the baseline's performance. There isn't a good way to determine how many of the top-k results one should select though. One option is to set k to be the average number of evidence chunks in the training data, and that is 1.6, which is not very different from the current setting. Do you have better ideas?