AmenRa / ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
https://amenra.github.io/ranx
MIT License
427 stars 23 forks source link

Problems with MAP #22

Closed Perenz closed 1 year ago

Perenz commented 2 years ago

I understood that, when evaluating MAP@k, relevance judgment scores equal to 0 are ignored. In my case, I get a bit of a weird behaviour.

I'm working on a balanced dataset with binary relevancy and define qrels by including both 1s and 0s documents. While ndcg@10 gives me results at about 0.7, MAP@10 is extremely low at about 0.10.

Can this be because, besides the very first documents, the model perform poorly or am I doing something wrong when evaluating?

qrels = Qrels.from_df(
    df=test_loaded_pdf,
    q_id_col="user_id",
    doc_id_col="run_session_id",
    score_col="target_binary",
)

run = Run.from_df(
    df=test_loaded_pdf,
    q_id_col="user_id",
    doc_id_col="run_session_id",
    score_col="predictions",
)

evaluate(qrels, run, ["map@10", "mrr", "ndcg@10"])

predictions in test_loaded_pdf is not a list of binary relevancy but it's a float relevancy score

AmenRa commented 2 years ago

Hi Stefano,

Almost all the metrics ignore qrels with zero scores, including NDCG. So the difference you get is not because of that.

However, I think your results are entirely possible if you have many relevance judgments for each query. Note that the Average Precision denominator is equal to the number of relevant documents, regardless of the cut-off. Conversely, DCG and, therefore, NDCG depends on the cut-off and not on the number of relevant documents. (Please, cross check this for correctness.)

Here is a toy example:

from ranx import Qrels, Run, evaluate

# 100 relevant docs
qrels = Qrels({
    "q1": {f"d{i}": 1 for i in range(100)}
})

# Only one relevant doc is returned
run = Run({
    "q1": {**{"d1":1000}, **{f"dd{i}": i for i in range(99)}}
})

>>>
{
    "map": 0.01,
    "map@10": 0.01,
    "ndcg@10": 0.22009176629808017,
    "ndcg@100": 0.047758523260819974,
}

From my experience, MAP is commonly used with larger cut-offs than 10 (usually 100) or with no cut-off at all.

Best,

Elias

AmenRa commented 1 year ago

Closing for inactivity. Feel free to re-open if needed.