AstraZeneca / rexmex

A general purpose recommender metrics library for fair evaluation.
https://rexmex.readthedocs.io/
278 stars 25 forks source link

Fix the division in the hits@k metric to use the size of the relevant_items #60

Closed sbonner0 closed 1 year ago

sbonner0 commented 1 year ago

This PR fixes the division in the hits@k metric to use the length of the relevant_items rather then the number of recommendations. I believe this metric should capture what fraction of the relevant_items are in the first k elements of the recommendation.

Given the example show below, the current code would return a value of ~0.42, but I believe it should be 1 as all of the items appear within the first 7 recommendations.


from rexmex.metrics.ranking import hits_at_k
relevant_items = [1,2,3]
recommendation = [2,5,4,1,3,0,6]
hits_at_k(relevant_items, recommendation, k=7)

After the change in this PR, the correct result is returned.

kajocina commented 1 year ago

Hey, I (hopefully) fixed the CI pipeline, could you perhaps pull main and rebase your branch onto main + push again? Then your PR should pass the build check.

sbonner0 commented 1 year ago

Hey @kajocina - Thanks! I have done this but now I am failing one of the unit tests for hits@k.

This has raised an issue with using the length of the relevant_items for the division. If k < len(relevant_items), then I am not sure it makes much sense either.

Using the example from the unit tests:


relevant_items = np.array([1, 2, 3, 4, 5, 6])
recommendation = np.array([1, 7, 3, 4, 5, 2, 10, 8, 9, 6])
hits_at_k(relevant_items, recommendation, k=1)

The result here using the length of the relevant_items for the division, the score would be ~ 0.16 as only one of the items can ever be in the first position in the list of recommendations. Not sure what to do about this...

kajocina commented 1 year ago

I'll patch the test to the right values, I think the test was passing only with the previous length :) looks like a bug

sbonner0 commented 1 year ago

Thanks @kajocina - still, even if we patch the values I am not sure what the right answer is when k < len(relevant_items).

Even if the k=1 and a relevant_item is first in the list of recommendations.....the returned value would be 1/len(relevant_items). I am not sure if this is correct or not. Maybe we should add to the docstring that the metric is a bit ill-defined when k < len(relevant_items).

kajocina commented 1 year ago

I think this is an expected behaviour of this metric, similarly here: https://docs.ampligraph.org/en/2.0.0/generated/ampligraph.evaluation.hits_at_n_score.html. In their case they have a set of four positive triples/items and their rankings:

rankings = np.array([1, 12, 6, 2])
hits_at_n_score(rankings, n=3)
0.5 

And correctly they get from n(k)=3 a value 0.5 because from 4 positive triples only two made it on the k=3 ranking (triple 1 and 2, obviously).

The question then is generally if k=1 is a reasonable choice for this metric. We could add info about this as you suggest :)

You can rebase on main again and repeat, I hope this time the tests should pass. I can add the docstring fix :)

kajocina commented 1 year ago

Thx, LGTM ;)

sbonner0 commented 1 year ago

Thanks @kajocina !