Closed anthonywchen closed 3 years ago
Can you please show us an example of where your call fails?
It still runs, so it technically isn't broken. It's just that the numbers returned by the evaluation script don't match the paper results. Here is a simplified version of my script for NQ
# Do retrieval
retriever = DPR_connector.DPR.from_default_config(name='dpr')
log_directory = utils.create_logdir_with_timestamp('tmp')
logger = None
logger = utils.init_logging(log_directory, 'dpr', logger)
# I created a config just for NQ
config = json.load(open('kilt/configs/nq_data.json'))
retrieval.run(config, retriever, model_name='dpr', logger=logger, debug=False, output_folder='tmp')
# Evaluate retrieval
guess = 'nq/nq-dev-kilt.jsonl'
gold = 'data/nq-dev-kilt.jsonl'
rank_keys = ['wikipedia_id']
ks = [1, 10, 50]
eval_retrieval.evaluate(gold, guess, ks, rank_keys)
Do this using the current script returns a very low R-Precision but changing line 30 like I mentioned above makes eval_retrieval.evaluate
return numbers that match the paper.
Hi! I wanted to follow up and say that the evaluation script works correctly for the DRQA_TFIDF retriever because the pages are sorted from highest score to lowest score.
But it didn't for the above example because it was with the output of the DPR retriever and because the pages here are sorted from lowest to highest. Standardizing the order of retrieved pages should address this. Also happy to submit a pull request if it would help!
Hey @anthonywchen, thanks a lot for spotting this! :) If you can create a pull request with a fix it would be awesome!
Hello! I believe that there is an issue with
kilt/eval_retrieval.py
. In the code, when specifying thek
value, the code grabs the k least similar passages instead of the k most similar passages.For example, on NQ validation set, the R-Precision value using the current evaluation script is 10.43.
When I flip the list of retrieved passages in line 30
for provenance in output["provenance"][::-1]:
I get a R-Precision value of 53.68, which aligns with paper results.