AmenRa / ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
https://amenra.github.io/ranx
MIT License
427 stars 23 forks source link

[BUG] Precision calculation incorrect? #31

Closed kaleko closed 1 year ago

kaleko commented 1 year ago

Describe the bug In the below example, I would expect run1 to have a precision of 1.0 and I would expect both run2 and run3 to have precisions of 0.75, as 3 out of 4 returned documents are relevant. Instead the second query returns 0.5, and the third 0.25. Either there is a bug handling empty query results, or I have a naive misunderstanding of precision. Also, run 2 and run 3 are similar, just with different queries returning null results. Please correct me if I'm wrong!

To Reproduce Steps to reproduce the behavior:

qrels_dict = {
    "q_1": {"doc_a": 1},
    "q_2": {"doc_b": 1, "doc_c": 1, "doc_d": 1},
    "q_3": {"doc_e": 1},
    "q_4": {"doc_f": 1},
}

run_dict_1 = {
    "q_1": {"doc_a": 1.0},
    "q_2": {"doc_d": 1.0},
    "q_3": {"doc_e": 1.0},
    "q_4": {"doc_f": 1.0},
}

run_dict_2 = {
    "q_1": {"doc_a": 1.0},
    "q_2": {"doc_d": 1.0},
    "q_3": {},
    "q_4": {"doc_f": 1.0},
}

run_dict_3 = {
    "q_1": {"doc_a": 1.0},
    "q_2": {},
    "q_3": {"doc_e": 1.0},
    "q_4": {"doc_f": 1.0},
}

qrels = Qrels(qrels_dict)
run1 = Run(run_dict_1)
run2 = Run(run_dict_2)
run3 = Run(run_dict_3)

print(evaluate(qrels, run1, ["precision"]))
print(evaluate(qrels, run2, ["precision"]))
print(evaluate(qrels, run3, ["precision"]))

1.0 0.5 0.25

kaleko commented 1 year ago

It seems that if there is an empty query result in the run_dict, every query after it will always have a precision of 0.

AmenRa commented 1 year ago

Hi @kaleko,

Thank you very much for the bug report and for providing a working example! numba was not raising a ZeroDivisionError, so I did not spot this issue before. I fixed it in v.0.3.4. Now it works as intended.

Please, consider giving ranx a star if you like it!

kaleko commented 1 year ago

@AmenRa I now see that in the above example, the outputs are run 1 --> precision of 1.0, run 2 --> precision of 0.75, run 3 --> precision 0.75.

It's good to see runs 2 and 3 have the same precision, the result of fixing your ZeroDivisionError issue.

However I question whether the actual precision calculation is correct. According to this comment https://github.com/AmenRa/ranx/blob/e21eb0879cdd881958915d5e27c839759f9d5801/ranx/metrics/precision.py#L40 Precision is the "proportion of retrieved documents that are relevant"

In all three runs above, every document which was retrieved was relevant. Shouldn't the precision be 1.0 for all runs?

AmenRa commented 1 year ago

Usually, a system does not return documents whose relevance score is zero. That's why you could end up with empty result lists, as in your example. However, this is probably "a convention" because 1) you cannot meaningfully order the documents if they all have the same relevance score (so the system's output would be kind of random), and 2) if the system returns the entire collection every time it is queried, it will have severe efficiency issues.

Moreover, if you cast Information Retrieval to a binary classification problem, the returned documents would be the data points judged as positives by the model and the non-returned ones as negatives. If you have a query for which no document was returned, the model judged all the documents as negatives (non-relevant to the query).

I think returning no documents for one or more queries is a corner case. If we take this corner case to the extreme, a system that never returns documents should have Precision=1.0 on average following the last line of your comment, which does not sound right to me.

Makes sense / do we agree?

kaleko commented 1 year ago

I guess I agree. It sounds like a convention.

For example, if I google "awefoihawoefihawoefihw" and zero results come back, did my query have 100% precision or 0% precision? I would argue 100%, but I can see both sides.

Thanks for the clarification.

AmenRa commented 1 year ago

If you find a theoretically sound explanation, please post it here.