AmenRa / ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
https://amenra.github.io/ranx
MIT License
427 stars 23 forks source link

Zero-scored documents #51

Closed diegoceccarelli closed 11 months ago

diegoceccarelli commented 1 year ago

@mponza found a bug when computing recall and promise to document it next week, adding a reminder here for him. Something related to multiple documents having score zero.

mponza commented 1 year ago

Describe the bug

If a model assigns to a set of documents zero scores, you can still have Recall@k = 1. I am not sure this behaviour is expected or not.

To reproduce

from ranx import Qrels, Run, evaluate

qrels_dict = {"q_1": {"d_0": 1, "d_1": 1}}

run_dict = {"q_1": {"d_0": 0, "d_1": 0}}

qrels = Qrels(qrels_dict)
run = Run(run_dict)

evaluate(qrels, run, "recall@2")
1.0

evaluate(qrels, run, "recall@0")
1.0

from ranx import Qrels, Run, evaluate

qrels_dict = {"q_1": {"d_0": 1, "d_1": 1}}

run_dict = {"q_1": {"d_0": 1, "d_1": 0}}

qrels = Qrels(qrels_dict)
run = Run(run_dict)

evaluate(qrels, run, "recall@2")
1.0

evaluate(qrels, run, "recall@0")
1.0

Sklearn's recall_score on this kind of inputs returns different values.


from sklearn.metrics import recall_score

recall_score([1, 1], [0, 0])
0.0

recall_score([1, 1], [1, 0])
0.5
AmenRa commented 1 year ago

I can reproduce the issue. It is not intended. However, I think you should not have zero-scored documents in your run in the first place.

When I started implementing ranx, I assumed there would not be zero-scored documents in a run because those are usually not returned by retrieval systems (to the best of my knowledge).

I can fix this pretty easily, but it could be somewhat costly if done every time an evaluation happens and completely unnecessary in most cases. It could also break normalized runs depending on the chosen normalization scheme.

I could add an optional flag to discard zero-scored documents when loading a run.

What do you think should the correct way of handling this?

mponza commented 1 year ago

Thank you for the prompt response - appreciated.

I like the flag solution, so you can give the user flexibility to choose what to do in such "zero-scored cases" and with the right default you should not break any use cases relying on the existing code.

Should this flag be boolean, or would it be better to have a float like run_rel_lvl (similar to rel_lvl in the metrics) where the user can decide what is the minimum relevance score to be consider for the runs? Do you think it's possible/reasonable to add such a parameter at metric-level too (in addition to the existing qrels, run, k, and rel_lvl)?

diegoceccarelli commented 1 year ago

How come it is costly? are you sure it's not premature optimization ;)? My two cents: if for a run you have all zeros it would be nice to throw an exception or print a warning - maybe the user has a bug in their code and it is not aware it is producing all zeros..

mponza commented 1 year ago

It's also possible the user is aware their code can produce all zeroes for a given query (eg having a method that classifies documents with 0/1) and such a behaviour is intended.

Logging a good warning message can be an option, but maybe it's worth to support the possibility of having empty runs? The existing library does seem to allow it:


from ranx import Qrels, Run, evaluate

qrels_dict = {"q_1": {"d_0": 1, "d_1": 1}}

run_dict = {"q_1": {}}

qrels = Qrels(qrels_dict)
run = Run(run_dict)

evaluate(qrels, run, "recall@2")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_2673/2663587829.py in <cell line: 8>()
      6 
      7 qrels = Qrels(qrels_dict)
----> 8 run = Run(run_dict)
      9 
     10 evaluate(qrels, run, "recall@2")

/layers/com.bloomberg.ds.buildpacks.pip/requirements/lib/python3.9/site-packages/ranx/data_structures/run.py in __init__(self, run, name)
     59             # Doc IDs
     60             doc_ids = [list(doc.keys()) for doc in run.values()]
---> 61             max_len = max(len(y) for x in doc_ids for y in x)
     62             dtype = f"<U{max_len}"
     63             doc_ids = TypedList([np.array(x, dtype=dtype) for x in doc_ids])

ValueError: max() arg is an empty sequence
AmenRa commented 1 year ago

Honestly, I feel we are entering corner cases and leaving ranking evaluation territory here.

The issue with zero-scored documents is very debatable. Neural rankers can output negative scores, for example. Scores are not used by ranking metrics. They are just a proxy for ordering. Raising a warning because of the scale used for the scores may be very annoying for people with a use case different from yours.

Also, using a ranking evaluation library for evaluating classification methods does not sound correct. Scikit-learn's recall_score function is a classification metric, not a ranking one. It has a pos_label parameter and does not work if you pass y_pred with scores different from 0 and 1 using default params configuration. Scikit-learn's recall_score is not meant for ranking evaluation, as ranx is not for classification evaluation.

The last code snippet is another corner case. The library works correctly if at least one query has retrieved documents. However, I agree that an informative warning message would be better for the user.

I am unsure what I should do, honestly. It is users' responsibility to properly format their data for the library they are using, from my perspective. I can add functions to help you format your data. However, I disagree with automatically raising warnings if the score scale does not meet specific criteria. There is no single use case.

mponza commented 1 year ago

I think leaving users the responsibility of how to format their data and deciding which documents to keep/remove can be the way to go, assuming they have the possibility to specify no documents returned for a given query without getting an exception (you also intended this in the last part of the message, right?).

For the case where a user has runs with documents with zero scores, do you think it's worth it to document such a behaviour explicitly in the README and motivate it with your points above to avoid future misusages?

AmenRa commented 1 year ago

What I meant is that your last snippet is a bit extreme. That's why I did not bother adding an exception before. Empty results are already allowed. Empty runs aren't. For example, this works just fine:

run = Run({"q_1": {}, "q_2": {"d_0": 1}})

I will update the documentation to make clearer what a Run is. Unfortunately, the README is already cluttered. I'll see if I can make it fit. Qrels and Run documentation is also outdated. Sorry about that!

mponza commented 12 months ago

What I meant is that your last snippet is a bit extreme. That's why I did not bother adding an exception before. Empty results are already allowed. Empty runs aren't. For example, this works just fine:

run = Run({"q_1": {}, "q_2": {"d_0": 1}})

I will update the documentation to make clearer what a Run is. Unfortunately, the README is already cluttered. I'll see if I can make it fit. Qrels and Run documentation is also outdated. Sorry about that!

Sounds good, looking forward to reading the updated documentation then! Maybe you can use the Wiki? Not sure.

Thank you for the insightful discussion and prompt support on this issue.

AmenRa commented 11 months ago

Hi, I updated the documentation and added a FAQ section. Let me know if you have any comment and thanks again for bringing this up!