Problem: current kNNWithMeans implementation doesn't allow for classical item-item kNN approach

ODemidenko commented 6 years ago

Current implementation for kNNWithMeans computes similarity measure on the original ratings, and than uses mean values to compute a prediction.

According to:

"Recommender systems. the textbook" by Aggarwal, chapter: 2.3.2 Item-Based Neighborhood Models,
initial paper on item-item CF: Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms.

For the purpose of item-item similarity computation: we should use "adjusted cosine" similariy, by taking into account not raw ratings, but rather subtracting average item-rating from each users-rating, before computing item similarities.

How to address the problem: Possible approach would be to add an "adjusted cosine" similarity measure. But in case we compute it independently of kNNWithMeans - this will mean that we will compute mean values twice, independently for similarity measure and predictions, which seems to be a computation-ineffective approach. So, I would suggest rather to incorporate mean-adjustment in the prediction algorythm itself.

NicolasHug commented 6 years ago

So basically that's the same as using Peason, but instead of instead of centering with the row average (for row-row similarity), and centering with the column average (for column-column), we center with the column average for row-row similarity and with the row average for column-column?

Possible approach would be to add an "adjusted cosine" similarity measure

Sure

So, I would suggest rather to incorporate mean-adjustment in the prediction algorythm itself.

No, we should have the choice between the similarity measures, regardless of the choice of algorithm. If a user wants to use an "inappropriate" measure for whatever reason, she should be able to do it.

ODemidenko commented 6 years ago

I am not sure whether we got each other correctly. I agree that we need to provide an opportunity to combine different algorithms and similarity measures. What I am offering - is to provide an algorithm which computes mean-centered similarities and does this efficiently (computes means only once, both for similarity-computations and further use in predictions).

My proposal: introduce "mean-centered" param to similarity measures, and use this param only within kNNWithMeans algo, by passing mean-centered values to the compute_similarities method. Thus, we'll compute mean values only once and keep the algorithm fitting more computationally efficient

ODemidenko commented 6 years ago

So basically that's the same as using Peason, but instead of instead of centering with the row average (for row-row similarity), and centering with the column average (for column-column), we center with the column average for row-row similarity and with the row average for column-column?

The answer is "yes".

ODemidenko commented 6 years ago

Sorry, I thought about it some more, and I was wrong here:

My proposal: introduce "mean-centered" param to similarity measures, and use this param only within kNNWithMeans algo, by passing mean-centered values to the compute_similarities method. Thus, we'll compute mean values only once and keep the algorithm fitting more computationally efficient

Adjusted cosine considers different means from KNNWithMeans (they use orthogonal means, as you have pointed), so, there is no need to pass some data from algo to the similarity measure and they could be completely decoupled.

NicolasHug commented 6 years ago

Yeah I agree.

Let's just implement the new similarity measure then. If some improvement can be done there will always be room for them.

ODemidenko commented 6 years ago

Nicolas, there is a couple of other things with adjusted cosine:

unlike pearson similarity it should be computed along the whole rating vectors, e.g. for item-item case it takes into account not just users who rated both items, but rather all of the users. This adds another desirable effect of damping similarity between items that although rated similarly by those users who rated them both were actually rated by a highly different set of users. According to the Coursera course, lead by one of the original authors of item-item CF, that is what how they originally computed it! This wasn't clarified in thier paper, but course lecture, exercises, and lecture's blog (see link below) - all show that full vectors should be used.
There is also a blog post on the "in the field" behaviour of I-I system (MovieLens), that states that using item-based mean-centring, together with adjusted cosine (computed along whole vectors) works better in practice. And here he provides explanation why it might work better.

Thus, I have a couple of proposals regarding "adjusted cosine implementation":

compute it over the whole item vectors, not only for users, who rated both items
give it an option to be both user-mean centred (as it goes in all the books) and item-mean centred (as it was proven to work better, by MovieLens)

ODemidenko commented 6 years ago

For the 1st bullet - actually, whole vectors (rather than only mutually weighted items) can be used with a "usual" cosine as well. Example: "Recommender system. The Textbook", 2.3.1.1, formula (2.6) So, adjusted cosine is just a mean-centred adjustment over this version of cosine.

BTW, I offer to implement all of this as separate scoring functions, rather than parameters:

cosine (leave as it is)
adjusted_cosine_orthogornal_means (?) for adjustment made on rows, when similarity goes on columns. Use full rating vectors for it.
mean_adjusted_cosine - for cosine, mean-adjusted along compared vectors. use full ratings for it.

FrancescaCristo89 commented 6 years ago

Hi everyone and thank you for this useful project.

I have also encountered the same issues trying to implement item-item CF with adjusted cosine using KNNBasic and passing as train set the user-mean-centered ratings matrix, instead of the original ratings matrix. The main blocker is that cosine similarity should use complete vectors rather than only elements of mutually weighted items as @ODemidenko says.

NicolasHug commented 6 years ago

I see.

Basically any similarity metric could use either complete vectors of ratings or just commonly rated items /common users, as currently implemented. I chose the latter because my intuition is that with the former one, we're comparing extremely sparse vectors and choosing a value of zero for non-existing ratings is completely arbitrary.

That being said the current version also has major flaws (basically we're comparing similarities which do not have the same support so that does not make sense).

So if using the whole vectors is commonly used and if it's efficient in practice, then there's no reason not to implement it. A good way for allowing this new version would be to add a common_ratings_only to the sim_options parameter, whose default would be True -- at least for now. Obviously this should be available for all metrics.

However :) , this issue is about implementing adjusted cosine similarity. Using the whole vectors or only common ratings is an entirely different matter.

ODemidenko commented 6 years ago

@NicolasHug As I need myself all kinds of cosines - I have implemented all the alternatives already, and will rather PR them all together.

But before commit - I need your agreement on the following design issue: I appreciate your desire of setting 'common_ratings_only' as a separate param, rather than providing a separate similarity function. Meanwhile, implementation-wise I don't see how to inject this param to the original 'cosine' function, without compromising code readability and performance. So, I have implemented each alternative as a separate function, as this makes code much more readable and will work faster. There are two alternatives to handle mismatch between the param-based approach we want to provide for API and actual different functions:

We can map different functions to the same "cosine" name in the dict of sim_options based on common_ratings_only value. This can be done within the compute_similarities function, which nicely decouples user interface from internal implementation. But such an approach arises another issue - current doc is autogenerated from similarities module docstrings. And thus - separate functions will spoil it.
We can leave single 'cosine' function, while actual workload will be passed within it to different functions, based on common_ratings_only param. Those functions won't have their own docstrings

It seems that 2nd approach is perfect, but probably I miss smth else. Could you confirm it?

NicolasHug commented 6 years ago

Yes there's no problem in creating different cosine functions (if it's really needed), as long as it's transparent for the user. The second option is great.

But once again, to keep the commit history clean, the implementation of adjusted cosine and the common_ratings_only option should be done in completely different PRs.

ODemidenko commented 6 years ago

I added common_ratings_only PR, and will add adjusted_cosine as soon as you will accept the first one.

Regarding adding documentation: Could you give links on the exact notation used for math formulas and some sites where I can check preview formulas for rendering correctness? I briefly tried researching it and came to realization that there are too many alternatives and I don't understand what are the common grounds between them.

FrancescaCristo89 commented 6 years ago

Hi,
@ODemidenko: I checked your implementation for cosine similarity with full vector implementation and it runs propertly :) . However my database is a bit large and it seem to be very slow (also other similarities from this project are slow for my data). If you are interested, I would just share with you what I did: I developed a new cosine similarity using the "cosine_similarity" function from sklearn.metrics.pairwise and it performs very well.

ODemidenko commented 6 years ago

@FrancescaCristo89 Thank you for the info!

NicolasHug commented 6 years ago

I'll close this issue as it's been stagnating and it's actually concerned with two different problems: supporting adjusted cosine, and computing similarities based on common ratings only (or not).

I've opened #163 and #164 in replacement (with reference to this issue). Separating these two issues will hopefully be clearer and easier to follow.

Nicolas

NicolasHug / Surprise

Problem: current kNNWithMeans implementation doesn't allow for classical item-item kNN approach #135