Update on pytrec Eval usage

yogeswarl commented 1 year ago

Dear @hosseinfani, From the looks of it, Pytrec eval's results are not giving the right results. Even Negar's work uses trec_eval, I believe. Those results have a map score of 1, precisely what I am looking for. Hence I will rewrite my calculation of metrics to trec_eval instead of pytrec_eval.

Below I post the results I have noticed from the toy dataset as support for my argument. pytrec_eval predicted_queries00.metrics.txt

trec_eval pq00.txt

hosseinfani commented 1 year ago

@yogeswarl I believe there is no problem with pytrec_eval. Probably, you provide the input incorrectly. Pytrec_eval is just a cython wrapper to trec_eval. there shouldn't be inconsistencies.

We can use both but preferably we need to use pytrec_eval as it does not depend on third party lib installation.

yogeswarl commented 1 year ago

@hosseinfani. These are my edge cases I believe require your attention and a hint towards the solution.

Throughout the Msmarco qrels, we are able to see their are qrels that are repeated.

instance 1

1008998 0   3651609 1
1008998 0   3651609 1

Instance 2

122093  0   1785921 1
122093  0   1785921 1

So when I do the predictions, Would you want me to run with the duplicates or remove and run without them in the pytrec_eval?

Also, would you like me to remove the queries that have returned an Nan from the predicted outputs? If I remove the queries from the qrels as well, This would be redundant since for every predicted queries file there will be different queries that were not predicted.

Please advice!!.

Thank you in advance.

hosseinfani commented 1 year ago

@yogeswarl Duplicate qrels should only affect the cutoff threshold k, e.g., success_10, since the top-k w/ and w/o duplicates might be different. I think the difference should be very minor. Do not remove the duplicates though when evaluating. At IR (search), I make the duplicate removal in lucene.searcher true here.

No, do not remove the Nans. Assume they're changes to a query. Removing them adds complexity to the code.

yogeswarl commented 1 year ago

Understood. The predicted_queries(PQ) available in the channel is without the NaN's

hosseinfani commented 1 year ago

@yogeswarl

Please do the following: (A)

we need toy.msmarco.passage/t5.base.gc.query.doc
we need toy.msmarco.passage/t5.base.gc.query.docs (the concat version)
we need toy.msmarco.passage/t5.base.gc.doc.query
we need toy.msmarco.passage/t5.base.gc.docs.query (the concat version)

Based on the naming convention, you know what to do, right?

(B) Before you run the t5 on the entire dataset, you can do the last step of the pipeline, that is, keeping those that make the IR metric better, similar to this code in ReQue project. I called it aggregation (agg for short)

https://github.com/fani-lab/ReQue/blob/c8695fa540d5df057bfc61c74fdd1252e3b49033/qe/main.py#L139

Make sure, the code works with no bug on the toy results.

(C) then you're good to go with the entire dataset.

Let me know if you have any questions

yogeswarl commented 1 year ago

Okay @hosseinfani . Will update you with the results!

yogeswarl commented 1 year ago

@hosseinfani, I am also thinking to add trec_eval as a git submodule to our project, So it becomes easy for people to simply add a new command to the cloning rather than downloading it on their own. please suggest?

hosseinfani commented 1 year ago

@yogeswarl good idea.

yogeswarl commented 1 year ago

Okay, Will do it with my next push to the repo

hosseinfani commented 1 year ago

@yogeswarl I know that you use git command line to push. Please use other tools like sourcetree. Do not push all your changes. Select those lines and files that are necessary. Also, do not override my changes unless it's necessary.

yogeswarl commented 1 year ago

@hosseinfani. During the searching part. When the searcher encounters an Nan, The code terminates. I am replacing the doc Id's with 0, but there is already a doc Id with 0. errorimg Any suggestions to tackle this issue?

hosseinfani commented 1 year ago

@yogeswarl this means that the row.query is nan, that is there is no query. Check if there is no query, don't call search(), instead return empty list (no docids)

yogeswarl commented 1 year ago

Thank you!! I will try that.

yogeswarl commented 1 year ago

@hosseinfani. The error persists with the Map functionality. I got the chance to load Negar's datasets. They use unique queries. I.e. they use the same qrels file. But the queries file contains only one query and no duplicates are found. So it will be run only once. In such scenarios we will have to ignore the duplicate queries found in our training. So our only way to get results is by running the docs-query param only!. This is my discovery. So I have run the docs query part of my prediction to the search and eval methods. Will update you once I have the results.!

hosseinfani commented 1 year ago

@yogeswarl When are you available today in the lab? I'm not sure I understood the issue.

yogeswarl commented 1 year ago

I will be there in the evening after 5.

yogeswarl commented 1 year ago

@hosseinfani, After a long and tiresome search on what is wrong, I could figure out that the MSMarco passage retrieval uses the MRR as the evaluation metric.On further research into how anserini tackles this issue, you can see the metric they calculate is MRR, and I dug further down to find out they use the set function!!!!.

This equals dropping duplicates in the ranked file and the relevant judgment file.

Also, with your suggestion, I could debug trec_eval and find that it uses a string compare version to check if the first and second lines have the same qid and pid. If so, it fails.

Then the funniest thing happened!!.

The first command runs the trec_eval, and the same files are passed on to the anserini script file written in python.

This file led to the discovery of removing duplicates.

Please advice!!

Regards, Yogeswar

yogeswarl commented 1 year ago

One small update. I created a new qrels file by simply loading and dropping duplicates! Using the pd.drop_duplicates().

When passed on to the trec_eval, this file worked like a charm. Hence, we should not be concerned about tampering with the qrels file. !

yogeswarl commented 1 year ago

@hosseinfani, When I create the aggregate function, Should I make one enormous file that has all the predicted queries greater than the original query or individual files for every predicted query file that have greater metric than the original query.

hosseinfani commented 1 year ago

@yogeswarl yes, one single file including all better changes for each query sorted by the value of metric, you could do like this:

qid, order, query, map 1, -1, apple, 0.5 ==> original query 1, 1, apple company, 0.7 ==> first best change 1, 2, apple iPhone, 0.6 ==> second best change 1, 3, apple fruit, 0.55 2, -1, python, 0.4 ==> original query 2, 1, python programming, 0.7 2, 2, python interpreter, 0.65

put the name of the metric in filename

fani-lab / RePair

Update on pytrec Eval usage #8

instance 1

Instance 2