AmenRa / ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
https://amenra.github.io/ranx
MIT License
438 stars 24 forks source link

Problem with r-precision #2

Closed osf9018 closed 2 years ago

osf9018 commented 2 years ago

Hi,

I tested your code and found that it was easy to use and integrate. Moreover, the results I got are fully coherent with those I previously obtained with a personal implementation of trec_eval and the computation of the measures is fast. This is clearly an interesting software and its presentation to the demo session of ECIR 2022 is a good thing.

I had only a problem with the R-precision measure. The main problem is that if you replace ""ndcg@5" in the 4th cell of the overview.ipynb notebook, you get:

`

TypeError Traceback (most recent call last) /tmp/ipykernel_28676/2318072837.py in 1 # Compute NDCG@5 ----> 2 evaluate(qrels, run, "r-precision")

/vol/data/ferret/tools-distrib/_research_code/rank_eval/rank_eval/meta_functions.py in evaluate(qrels, run, metrics, return_mean, threads, save_results_in_run) 149 for m, scores in metric_scores_dict.items(): 150 for i, q_id in enumerate(run.get_query_ids()): --> 151 run.scores[m][q_id] = scores[i] 152 # Prepare output ----------------------------------------------------------- 153 if return_mean:

TypeError: 'numpy.float64' object does not support item assignment `

I first detected the problem through the integration of your code and obtained the same error. By looking at the file meta_functions.py where the problem arises:

143 if type(run) == Run and save_results_in_run: 144 for m, scores in metric_scores_dict.items(): 145 if m not in ["r_precision", "r-precision"]: 146 run.mean_scores[m] = np.mean(scores) 147 else: 148 run.scores[m] = np.mean(scores) 149 for m, scores in metric_scores_dict.items(): 150 for i, q_id in enumerate(run.get_query_ids()): 151 run.scores[m][q_id] = scores[i]

I saw your recent last update of this part of the code but there is still a problem since for R-precision, the mean of the scores is stored in run.score and not in run.mean_scores. As a consequence, the use of run.scores for storing the score of each query raises a problem if both return_mean and save_results_in_run flags are set to True. More globally, I am not sure to understand why you differentiate R-precision from the other measures concerning the computation of the mean score.

Thank you by advance for your efforts for fixing the issue.

Olivier

AmenRa commented 2 years ago

Hi Oliver,

Thanks for your interest in rank_eval and the kind words.

Yesterday, when I did the last commit, I noticed something was off in the code there! I'm going to address the problem in the next few days and come back to you.

Thanks for your feedback.

Have a good one,

Elias

AmenRa commented 2 years ago

@osf9018 the issue is now fixed.

Thanks again for your feedback!

Closing.

osf9018 commented 2 years ago
  Hi Elias,

Thanks for the fix.

Until recently, I focused mainly on recall, MAP, R-precision and @. but since one of my students is using @. in the context of question answering, I also consider including it in the measures I use. If I don't make a mistake, you define @.*** as "the number of relevant documents retrieved". It is no so easy to find a reference definition for this measure but I have the feeling that it is most likely defined as the fraction of queries for which at least one relevant document is found among its top-k retrieved documents. At the query level, it is a binary measure : 0 is none relevant document is found among the top-k retrieved documents and 1 if at least one relevant document is found.

I agree that in Information Retrieval, we can speak about the number of hits for referring to the number of relevant documents among the documents retrieved for a query but I have the feeling that @. generally refers to a measure with values in the range [0,1]. Did you define your @. measure with a specific reference in mind?

Best regards,

           Olivier

Le mer. 1 déc. 2021 à 19:15, Elias Bassani @.***> a écrit :

@osf9018 https://github.com/osf9018 the issue is now fixed.

Thanks again for your feedback!

Closing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AmenRa/rank_eval/issues/2#issuecomment-983929962, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACS37TCNACMCBABQQRLD4H3UOZQ55ANCNFSM5JBG6I7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

AmenRa commented 2 years ago

Sorry Oliver, I do not understand what ***@***.*** means. Could you clarify, please?

osf9018 commented 2 years ago
     Hi Elias,

I am not sure to understand what you don't understand since I don't see any @ in my message :-) You mean @.***? More globally, my message was just about the definition of the hits measure.

     Olivier

Le lun. 6 déc. 2021 à 09:33, Elias Bassani @.***> a écrit :

Sorry Oliver, I do not understand what @.*** means. Could you clarify, please?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AmenRa/rank_eval/issues/2#issuecomment-986549808, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACS37TDZFAC74BSYKKETME3UPRYN7ANCNFSM5JBG6I7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

AmenRa commented 2 years ago

Sorry Oliver, I am confused: I see ***@***.*** in your messages on four different browsers on two devices.

I think I probably called that measure Hits because it is the (mean) number of relevant documents retrieved for each query. It is an integer value for each query, not a boolean one. I am sure I saw it in some paper, but I do not recall which one at the moment.

It is a sub-function of other metrics, so I decided to expose it if someone wants to use it, but maybe I should hide it as it is not that useful anyway.

For example, precision is something like:

def precision(qrels, run, k):
    # If k is 0 use the number of retrieved documents
    k = k if k != 0 else len(run)

    return hits(qrels, run, k) / k

while recall is something like:

def recall(qrels, run, k):
    # If k is 0 use the number of retrieved documents
    # In this case k is used just to avoid useless computations
    # as we divide the number of retrieved relevant documents (hits)
    # by the number of relevant documents later
    k = k if k != 0 else len(run)

    return hits(qrels, run, k) / len(qrels)

You can use it for an analysis purpose if you want, but I suggest you stick with the other metrics for scientific evaluation / comparison.

osf9018 commented 2 years ago
  Hi Elias,

Le jeu. 9 déc. 2021 à 15:45, Elias Bassani @.***> a écrit :

Sorry Oliver, I am confused: I see @.*** in your messages on four different browsers on two devices.

Funny. I guess it is perhaps the message passed through a github email address and @ is interpreted as something special in this context.

I think I probably called that measure Hits because it is the (mean) number of relevant documents retrieved for each query. It is an integer value for each query, not a boolean one. I am sure I saw it in some paper, but I do not recall which one at the moment.

It is a sub-function of other metrics, so I decided to expose it if someone wants to use it, but maybe I should hide it as it is not that useful anyway.

For example, precision is something like:

def precision(qrels, run, k):

If k is 0 use the number of retrieved documents

k = k if k != 0 else len(run)

return hits(qrels, run, k) / k

while recall is something like:

def recall(qrels, run, k):

If k is 0 use the number of retrieved documents

# In this case k is used just to avoid useless computations
# as we divide the number of retrieved relevant documents (hits)
# by the number of relevant documents later
k = k if k != 0 else len(run)

return hits(qrels, run, k) / len(qrels)

You can use it for an analysis purpose if you want, but I suggest you stick with the other metrics for scientific evaluation / comparison.

I see your point and it was already what I understood from your code. Perhaps, it can be less confusing to change the name of this metric in https://github.com/AmenRa/rank_eval/blob/76b6e241b4c8a860e72c305d95204e5bc04d20bf/rank_eval/meta_functions.py#L29 into something like n_hits or something like that.

if metric == "hits": return hits

--> if metric == "n_hits": return hits

Only a suggestion for avoiding misunderstanding from people who know "hits at k" as a metric with the definition I mentioned. But it is not essential.

Best regards,

     Olivier
AmenRa commented 2 years ago

Hi Oliver,

Thanks again for your feedback. I will take your suggestion into consideration.

I also inform you that my tool is going to change name soon because of naming similarities with other tools. The new name is ranx. You can already install the library with pip install ranx.

Best regards,

Elias