facebookresearch / KILT

Library for Knowledge Intensive Language Tasks
MIT License
910 stars 91 forks source link

Issue with eval_retrieval.py #23

Closed anthonywchen closed 3 years ago

anthonywchen commented 3 years ago

Hello! I believe that there is an issue with kilt/eval_retrieval.py. In the code, when specifying the k value, the code grabs the k least similar passages instead of the k most similar passages.

For example, on NQ validation set, the R-Precision value using the current evaluation script is 10.43.

When I flip the list of retrieved passages in line 30 for provenance in output["provenance"][::-1]:

I get a R-Precision value of 53.68, which aligns with paper results.

nicola-decao commented 3 years ago

Can you please show us an example of where your call fails?

anthonywchen commented 3 years ago

It still runs, so it technically isn't broken. It's just that the numbers returned by the evaluation script don't match the paper results. Here is a simplified version of my script for NQ

# Do retrieval
retriever = DPR_connector.DPR.from_default_config(name='dpr')
log_directory = utils.create_logdir_with_timestamp('tmp')
logger = None
logger = utils.init_logging(log_directory, 'dpr', logger)
# I created a config just for NQ
config = json.load(open('kilt/configs/nq_data.json'))
retrieval.run(config, retriever, model_name='dpr', logger=logger, debug=False, output_folder='tmp')

# Evaluate retrieval
guess = 'nq/nq-dev-kilt.jsonl'
gold = 'data/nq-dev-kilt.jsonl'
rank_keys = ['wikipedia_id']
ks = [1, 10, 50]
eval_retrieval.evaluate(gold, guess, ks, rank_keys)

Do this using the current script returns a very low R-Precision but changing line 30 like I mentioned above makes eval_retrieval.evaluate return numbers that match the paper.

anthonywchen commented 3 years ago

Hi! I wanted to follow up and say that the evaluation script works correctly for the DRQA_TFIDF retriever because the pages are sorted from highest score to lowest score.

But it didn't for the above example because it was with the output of the DPR retriever and because the pages here are sorted from lowest to highest. Standardizing the order of retrieved pages should address this. Also happy to submit a pull request if it would help!

fabiopetroni commented 3 years ago

Hey @anthonywchen, thanks a lot for spotting this! :) If you can create a pull request with a fix it would be awesome!