Fix `ranking_metrics_at_k()`

benfred / implicit

Fast Python Collaborative Filtering for Implicit Feedback Datasets

https://benfred.github.io/implicit/

MIT License

3.57k stars 612 forks source link

Fix `ranking_metrics_at_k()` #591

Closed ita9naiwa closed 1 year ago

ita9naiwa commented 2 years ago

This PR resolves a few issues;

412 "precision" on ranking_metrics_at_k is actually "recall"

I guess it's fine to update precision and recall since this library took major braking update (0.5.0)
545: ranking_metric_at_k raises ValueError if K > num_items

This PR adds MRR, and Precision as new metrics.

ita9naiwa commented 2 years ago

hi @benfred. Can you check and review this PR? this resolves inaccurate NDCG and MRR values of ranking_metrics_at_k function.

ita9naiwa commented 2 years ago

Changes:

Precision and Recall metric has been switched.
fix MAP metric following
Added MRR, since it is also one of the most widely leveraged metrics in RS community e.g., RecSys Challenge 2022.

tr, te = train_test_split(ratings, random_state=1541)
model = AlternatingLeastSquares(random_state=1541, factors=30, iterations=10)
model.fit(tr)
ranking_metrics_at_k(model, tr, te, K=100)

as is :

{'precision': 0.3349958296821056,
 'map': 0.12534890653797998,
 'ndcg': 0.2686550155007732,
 'auc': 0.6093577862786992}

to be:

{'precision': 0.07930221607727832,
 'recall': 0.3349958296820349,
 'map': 0.06293165699220135,
 'ndcg': 0.2686550155007732,
 'auc': 0.6093577862785867,
 'mrr': 0.5348017396151994}

I guess that definition of MAP should follow precision

thomasjungblut commented 2 years ago

@benfred any ETA on getting this in and released? I was debugging a model yesterday that had weird evaluation results and came to the same conclusion as @ita9naiwa.

malonsocortes commented 2 years ago

Hi @ita9naiwa. I was checking the code for your fix on ranking_metrics_at_k and I'm not sure about the way you define the denominator of Precision. You're using the size of the user's liked items on the test set, but shouldn't it be K, the number of recommended items? K would include True Positives + False Positives, which is what I have normally seen in the definitions I have read of precision. Correct me if I'm wrong, I'd appreciate your opinion on the issue. Thanks!

Blo0dR0gue commented 1 year ago

And the divisor for Recall is also wrong. It should always be divided by likes.size() and not by k if k is smaller. This would only push the score and not return the true recall value. Or am I wrong?

benfred / implicit

Fix `ranking_metrics_at_k()` #591

412 "precision" on ranking_metrics_at_k is actually "recall"

545: ranking_metric_at_k raises ValueError if K > num_items

545: ranking_metric_at_k raises `ValueError` if K > num_items