Train test split per user ratings

psymeonidis commented 4 years ago

Dear Nicolas Hug,

first of all i want to tell you that your library is interesting.

However, I just checked your library and see that the train test split is not on the level of the user or what? Do you support a train test split that goes to every user's profile and hides a percentage of his ratings? For example, if we need a 4-fold cross validation evaluation protocol, we have to go to every user and hide the 25% of his ratings and try to predict the hidden ones with the rest 75% which will be used for training data..The evaluation protocol should run four times so that every time random item ratings should be placed to the two train-test sets (75 and 25) per user ..

A second comment is this. I tried to verify the results of MMLite and Librec with your library for the item recommendation task and I failed. I get for top-5 recommended items, a precision of 92% for ML-100k and 5-fold cross validation for your library with SVD method. This is impossible!!!!! Please notice that wRMF or BPR algorithm attain around 40% precision in the aforementioned libraries for the item recommendation task (or else item ranking) as can be seen in the following link

https://www.librec.net/release/v1.3/example.html

I have verified the same results already with another python-based implementation that I have in my ownership!!! What is different in your case? Why we get this huge precision values?

For the computation of precision, I used the following code from your library:

from collections import defaultdict
from surprise import Dataset
from surprise import SVD
from surprise.model_selection import KFold

def precision_recall_at_k(predictions, k=10, threshold=3.5):
    '''Return precision and recall at k metrics for each user.'''

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1

        # Recall@K: Proportion of relevant items that are recommended
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

    return precisions, recalls

data = Dataset.load_builtin('ml-100k')
kf = KFold(n_splits=5)
algo = SVD()

for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)

    # Precision and recall can then be averaged over all users
    print(sum(prec for prec in precisions.values()) / len(precisions))
    print(sum(rec for rec in recalls.values()) / len(recalls))

best,

Dr. Panagiotis Symeonidis

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

NicolasHug commented 4 years ago

The train / test dichotomy is done on a rating-basis, irrespective of users or items. On expectation that still means that x% of each user's ratings are hidden. This is the standard evaluation procedure as far as I know.

For the rest, sorry, I don't know how other libraries compute precision and recall. They might differ.

psymeonidis commented 4 years ago

Sorry, but I am working with Recommender Systems 15 years now, and there is no paper with precision 92% for the movielens 100K data set for top-5 recommended items and 5-fold cross validation. So, I really would like to help in improving your metric. I searched a bit the internet and found the following article "Recall and Precision at k for Recommender Systems"

https://medium.com/@m_n_malaeb/recall-and-precision-at-k-for-recommender-systems-618483226c54

which is WRITTEN BY Maher Malaeb. In the article, it is written that whenever the system recommends an item, which has not been rated by the user, it skips it from increasing the denominator of the precision metric. It says "As a start we will ignore all the ratings where the actual value is not known. Values with no known true rating cannot be used."

I strongly believe that this is the reason which the system produces so high precision (92%). The system recommends to users items that they have not rated at all, and it chooses not to consider them as a wrong recommendation, and instead it skips them.

However, if the user has not interacted with an item and the system recommends this item to him, we cannot skip it from taking under consideration for the precision metric by saying that we do not know if the user likes it or not..

For example, if algorithm 1 for top-10 recommended items, recommends 9 items that the user has not rated at all and only 1 item that the user has rated positively (i.e., > threshold) then has 100% precision. Another algorithm, e.g., Algorithm 2 recommends 10 items, that the user has rated and 6 out of the 10 items have been rated by the user positively. So, algorithm 2 attains precision 60%. This is not fair!!!!! Algorithm 1 had 9 False Positive and we do not count them..

The same penalty should be also applied for those algorithms that provide no recommendations at all or less number of recommended items than those that are requested . However, in the aforementioned article again writes " If there are no items recommended. i.e. number of recommended items at k is zero, we cannot compute precision at k since we cannot divide by zero. In that case we set precision at k to 1. This makes sense because in that case we do not have any recommended item that is not relevant." This is also not fair for comparing two algorithms that one can provide recommendations and the other cannot recommend all the items that are requested..

What do you think?

NicolasHug commented 4 years ago

I'm not sure I follow. If we don't know the true rating of a user-item pair, we cannot use this information for computing the precision or the recall: you can refer to the definition that is used here https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-compute-precision-k-and-recall-k

How can you compute the set of relevant items if you don't know the true ratings?

If you look at the code, you will notice that by convention, users that have no recommended items are assigned an accuracy of 1. That's the reason it's so high. It's a convention I chose, but it might not be a standard.

In any case, I strongly advocate for people using tools that they understand. This is the reason I try to document surprise as much as possible. Copy/pasting code blindly isn't a reasonable option. The code for computing precision is literally a few lines of code and I strongly encourage anyone using to actually read what it does.

psymeonidis commented 4 years ago

Sorry, but you have confused the rating prediction task (RMSE, MAE, etc.) with the item recommendation task (Precision, nDCG, etc.). These are two completely different PROBLEMS. And by the way, trying to solve the second problem by solving the first problem in advance is not a good choice..That is, the ranking of the highest predicted ratings to provide item recommendations is very trivial and has been proved in the past decade that is also very ineffective!!!! This is why methods that are only focused in ranking (e.g., Bayesian Personalised Ranking -BPR) are considered as the state-of-the-art in the last years for the item recommendation task.

In the rating prediction task you need the true rating value to compute the difference with the predicted value (otherwise you skip this item from the evaluation). In the item recommendation task, you do not need the existence of a real rating that the use has put for an item.. if you recommend an item to a user and you see in the test set that the user has not rated this item, then this item recommendation that you have provided is WRONG!!!!! In your case, as I understand, you believe that it is better to skip this item from me evaluation.. This is not fair.

Please notice that, if there is a real rating of the user in the test set for the recommended item, then this rating should be, of course, above a user-defined threshold to be considered as a HIT. Of course, this is something that you consider in your library.

NicolasHug commented 4 years ago

Sorry, but you have confused the rating prediction task (RMSE, MAE, etc.) with the item recommendation task (Precision, nDCG, etc.).

No, I'm not confusing anything.

These are two completely different PROBLEMS. And by the way, trying to solve the second problem by solving the first problem in advance is not a good choice..That is, the ranking of the highest predicted ratings to provide item recommendations is very trivial and has been proved in the past decade that is also very ineffective!!!!

You are 100% correct. This is precisely why the only built-in metrics supported by Surprise are rating prediction metrics like RMSE, MAE etc. The precision/recall is only a short snippet in the FAQ that serves as simple example. This example, which you have been using, is carefully documented and it's clear from there that we are dealing with explicit ratings, and that we use rating prediction as an intermediate to compute precision@K (whose definition is, again, given in the example). I'd be more than happy to consider a PR of yours with e.g. "Please note that in practice, computing rank-based metrics via explicit rating prediction is not recommended."

Surprise is a rating prediction library, as is clear for anyone that has spent more than 10 seconds reading the docs. It doesn't pretend to be anything else.

About the definition of precision: the general definition of precision is simply the percentage of recommended items that truly turn out to be relevant (quoting the Recommender Systems Textbook by Aggarwal). From there, one has to decide:

what defines a relevant item
what defines a recommended item
whether to compute this metric in a user-based fashion (and maybe how to aggregate those user-based results), or globally across all users.

The choices made in the Surprise snippet are explicitly detailed.

I hope this clears out the confusion.

Dr Hug.

NicolasHug commented 4 years ago

I believe the concerns have been addressed, closing the issue.

NicolasHug / Surprise