dkaterenchuk / ranking_measures

RankDCG: ranking/ordering evaluation measure
MIT License
35 stars 8 forks source link

Use RankDCG with prediction scores #3

Closed thisisjl closed 6 years ago

thisisjl commented 6 years ago

Hi @dkaterenchuk, thanks for sharing this code. I am new to recommender systems and I would like to use RankDCG and the rest of measures in the module to evaluate my recommender system.

At the moment, I am evaluating it by using this implementation of precision at k, and I am comparing the output I get with your implementation (find_precision_k). However, I get different results.

The function I linked is used for evaluating a model which predicts ratings. The input are the actual ratings and the prediction of the ratings:

def my_precision_at_k(truth, predictions, k=10):
    top_k = np.argsort(-predictions)[:k]
    labels = truth.indices # where not zero
    precision = float(len(set(top_k) & set(labels))) / float(k)
    return precision

I assumed that the input for your find_precision_k() are the ranked ratings and the ranked predicted values. However, by doing that, I do not get the same results as with the method before:

top_k = np.argsort(-prediction)[:k]
reference = np.argsort(-truth)
score = find_precision_k(reference, top_k, k=k)

Could you tell me what I am doing wrong? What should I input instead? Thank you

dkaterenchuk commented 6 years ago

Hi @thisisjl,

I like your use of numpy functions in the my_precision_at_k(). It is quite clever. What is your input to the function? One thing you should keep in mind is that precision is designed for binary relevance evaluation (it is used with multi-class, but might be not the best option). It looks like in my_precision_at_k(), you are evaluating the ordering produced by np.argsort and find_precision_k() uses class labels. This might cause the difference between the functions. Take a look at the example below.

Ex: given that a list a = [ 0, 0, 1, 1, 0, 1, 0, 1] and k = 3, the ordering fromnp.argsort(-a) is [2,3,5,7,0,1,4,6] and np.ardsort(-a)[:k] is [2,3,5]. If your algorithms recommend [2,3,7], which is still a valid prediction, the precision will be only 66.7%. This case extends to multiclass evaluation as well. This is why find_precision_k() uses class labels.

I hope it helps!

thisisjl commented 6 years ago

Hi @dkaterenchuk, thank you very much for your reply.

Note: I made a mistake in my first comment, I did not link properly the source of the precision function. I am using this implementation of precision at k.

The input to my_precision_at_k() is:

As I understand from your reply, in order to use find_precision_k(), I need to process the variables truth and prediction as I did before:

reference = np.argsort(-truth)
hypothesis = np.argsort(-prediction)
score = find_precision_k(reference, hypothesis, k=k)

Does that make sense? Thanks!

dkaterenchuk commented 6 years ago

Hi @thisisjl,

You can do that, but a better way would be to use the class labels sorted in the expected order. Here is a code that does it:

truth = np.array([1,0,1,1,0])
prediction = np.array([0,1,1,1,1])  # not very accurate predictions

true_order = np.argsort(-truth)

reference = [truth[x] for x in true_order]
hypothesis = [prediction[x] for x in true_order]

score = find_precision_k(reference, hypothesis, k=k)

Let me know if it helps or if you have any followup questions!

thisisjl commented 6 years ago

Hi @dkaterenchuk, thanks for your answer and sorry for my late reply (I had to focus in something else).

I think in my case, using scores, I will have to define the predicted class by comparing the predicted score to a threshold. However, in the case of a ranking, the predicted classes will be all 1 for the top items. Following your example, I would have, for example:

reference = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
hypothesis = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Where 1 means if the item is relevant or not. However, find_rankdcg returns 0.0 with this input. Therefore I am not sure if that is the correct approach.

In this gist I have written an example of what I am doing for reference.

I would really appreciate if you could give me feedback on it so I can use your metrics.

Thank you

dkaterenchuk commented 6 years ago

Hi @thisisjl,

Looking at your case, where the items are relevant or not ( 0, 1 - classes), the F1 score would be a better measure. The reason why you get 0 from RankDCG is that when your hypothesis returns all 1s for the classes, the items' order remains the same. In other words, the order of your hypothesis is identical to the reference. You should use a measure that considers binary class and false positives. Use rankDCG only when you have n items with k (where k > 2) classes and an algorithm needs to place the items (give a score) in a correct order with respect to each other. One example of such problem would be to predict how many connections users of a social network platform have.

Hope it helps!

thisisjl commented 6 years ago

Thank you @dkaterenchuk :)