Closed ODemidenko closed 6 years ago
So basically that's the same as using Peason, but instead of instead of centering with the row average (for row-row similarity), and centering with the column average (for column-column), we center with the column average for row-row similarity and with the row average for column-column?
Possible approach would be to add an "adjusted cosine" similarity measure
Sure
So, I would suggest rather to incorporate mean-adjustment in the prediction algorythm itself.
No, we should have the choice between the similarity measures, regardless of the choice of algorithm. If a user wants to use an "inappropriate" measure for whatever reason, she should be able to do it.
I am not sure whether we got each other correctly. I agree that we need to provide an opportunity to combine different algorithms and similarity measures. What I am offering - is to provide an algorithm which computes mean-centered similarities and does this efficiently (computes means only once, both for similarity-computations and further use in predictions).
My proposal: introduce "mean-centered" param to similarity measures, and use this param only within kNNWithMeans algo, by passing mean-centered values to the compute_similarities method. Thus, we'll compute mean values only once and keep the algorithm fitting more computationally efficient
So basically that's the same as using Peason, but instead of instead of centering with the row average (for row-row similarity), and centering with the column average (for column-column), we center with the column average for row-row similarity and with the row average for column-column?
The answer is "yes".
Sorry, I thought about it some more, and I was wrong here:
My proposal: introduce "mean-centered" param to similarity measures, and use this param only within kNNWithMeans algo, by passing mean-centered values to the compute_similarities method. Thus, we'll compute mean values only once and keep the algorithm fitting more computationally efficient
Adjusted cosine considers different means from KNNWithMeans (they use orthogonal means, as you have pointed), so, there is no need to pass some data from algo to the similarity measure and they could be completely decoupled.
Yeah I agree.
Let's just implement the new similarity measure then. If some improvement can be done there will always be room for them.
Nicolas, there is a couple of other things with adjusted cosine:
Thus, I have a couple of proposals regarding "adjusted cosine implementation":
For the 1st bullet - actually, whole vectors (rather than only mutually weighted items) can be used with a "usual" cosine as well. Example: "Recommender system. The Textbook", 2.3.1.1, formula (2.6) So, adjusted cosine is just a mean-centred adjustment over this version of cosine.
BTW, I offer to implement all of this as separate scoring functions, rather than parameters:
Hi everyone and thank you for this useful project.
I have also encountered the same issues trying to implement item-item CF with adjusted cosine using KNNBasic and passing as train set the user-mean-centered ratings matrix, instead of the original ratings matrix. The main blocker is that cosine similarity should use complete vectors rather than only elements of mutually weighted items as @ODemidenko says.
I see.
Basically any similarity metric could use either complete vectors of ratings or just commonly rated items /common users, as currently implemented. I chose the latter because my intuition is that with the former one, we're comparing extremely sparse vectors and choosing a value of zero for non-existing ratings is completely arbitrary.
That being said the current version also has major flaws (basically we're comparing similarities which do not have the same support so that does not make sense).
So if using the whole vectors is commonly used and if it's efficient in practice, then there's no reason not to implement it. A good way for allowing this new version would be to add a common_ratings_only
to the sim_options
parameter, whose default would be True -- at least for now. Obviously this should be available for all metrics.
However :) , this issue is about implementing adjusted cosine similarity. Using the whole vectors or only common ratings is an entirely different matter.
@NicolasHug As I need myself all kinds of cosines - I have implemented all the alternatives already, and will rather PR them all together.
But before commit - I need your agreement on the following design issue: I appreciate your desire of setting 'common_ratings_only' as a separate param, rather than providing a separate similarity function. Meanwhile, implementation-wise I don't see how to inject this param to the original 'cosine' function, without compromising code readability and performance. So, I have implemented each alternative as a separate function, as this makes code much more readable and will work faster. There are two alternatives to handle mismatch between the param-based approach we want to provide for API and actual different functions:
sim_options
based on common_ratings_only
value. This can be done within the compute_similarities
function, which nicely decouples user interface from internal implementation.
But such an approach arises another issue - current doc is autogenerated from similarities module docstrings. And thus - separate functions will spoil it.common_ratings_only
param. Those functions won't have their own docstringsIt seems that 2nd approach is perfect, but probably I miss smth else. Could you confirm it?
Yes there's no problem in creating different cosine
functions (if it's really needed), as long as it's transparent for the user.
The second option is great.
But once again, to keep the commit history clean, the implementation of adjusted cosine
and the common_ratings_only
option should be done in completely different PRs.
I added common_ratings_only
PR, and will add adjusted_cosine as soon as you will accept the first one.
Regarding adding documentation: Could you give links on the exact notation used for math formulas and some sites where I can check preview formulas for rendering correctness? I briefly tried researching it and came to realization that there are too many alternatives and I don't understand what are the common grounds between them.
Hi,
@ODemidenko: I checked your implementation for cosine similarity with full vector implementation and it runs propertly :) . However my database is a bit large and it seem to be very slow (also other similarities from this project are slow for my data). If you are interested, I would just share with you what I did: I developed a new cosine similarity using the "cosine_similarity" function from sklearn.metrics.pairwise and it performs very well.
@FrancescaCristo89 Thank you for the info!
I'll close this issue as it's been stagnating and it's actually concerned with two different problems: supporting adjusted cosine, and computing similarities based on common ratings only (or not).
I've opened #163 and #164 in replacement (with reference to this issue). Separating these two issues will hopefully be clearer and easier to follow.
Nicolas
Current implementation for kNNWithMeans computes similarity measure on the original ratings, and than uses mean values to compute a prediction.
According to:
For the purpose of item-item similarity computation: we should use "adjusted cosine" similariy, by taking into account not raw ratings, but rather subtracting average item-rating from each users-rating, before computing item similarities.
How to address the problem: Possible approach would be to add an "adjusted cosine" similarity measure. But in case we compute it independently of kNNWithMeans - this will mean that we will compute mean values twice, independently for similarity measure and predictions, which seems to be a computation-ineffective approach. So, I would suggest rather to incorporate mean-adjustment in the prediction algorythm itself.