About Pearson similarity

gasevi / pyreclab

pyRecLab is a library for quickly testing and prototyping of traditional recommender system methods, such as User KNN, Item KNN and FunkSVD Collaborative Filtering. It is developed and maintained by Gabriel Sepúlveda and Vicente Domínguez, advised by Prof. Denis Parra, all of them in Computer Science Department at PUC Chile, IA Lab and SocVis Lab.

GNU General Public License v3.0

122 stars 28 forks source link

About Pearson similarity #19

Closed KunlinY closed 6 years ago

KunlinY commented 6 years ago

Hi! When I looked into the implementation of Pearson similarity in Similarity.h, I am confused by the following code: https://github.com/gasevi/pyreclab/blob/master/algorithms/Similarity.h#L92 ` for( ind = v1.begin() ; ind != end ; ++ind ) { double rv1 = *ind; double rv2 = v2.get( ind.index() );

     if( rv2 > 0 ) 
     {
        devp += ( rv1 - mean1 ) * ( rv2 - mean2 );
        dev1 += ( rv1 - mean1 ) * ( rv1 - mean1 ); // Why put this line here??
        dev2 += ( rv2 - mean2 ) * ( rv2 - mean2 );
     }
  }

If the index of rv1 is not in rv2, then the dev1 will not be increased. Is there something wrong?

gasevi commented 6 years ago

Hi @KunlinY ,

first of all, the variables devp, dev1 and dev2 can be directly interpreted from the Pearson similarity definition. devp is the numerator and sqrt( dev1 ) * sqrt( dev2 ) is the denominator. I am not sure why dev1 seems to be wrong for you. In relation to your final question, to calculate the Pearson similarity we only have to consider the terms that are not zero, i.e., the dimensions that have been rated by both users. Take a look to the paper "Collaborative Filtering Recommender Systems", J. Ben Schafer et al. on page 12 ( 302 ). It say: "The Pearson correlation coefficient is calculated by comparing ratings for all items rated by both the target user and the neighbor". So you can ask me, why rv2 and not rv1 ? ... just a design criteria. Can I ask you why you need so much information ?

Best,

GSV.

KunlinY commented 6 years ago

Thank for your quick reply! I got your meaning @gasevi. However, I would like to know why dev1 increases only when rv2 is not zero, i.e., when user2 rate item ind. I think the calculation of dev1 should be independent of the value of rv2 and the increase of dev1 might need to be moved out of the IF condition. In my case, the code should be changed to:

for( ind = v1.begin() ; ind != end ; ++ind )
{
     double rv1 = *ind;
     double rv2 = v2.get( ind.index() );
     dev1 += ( rv1 - mean1 ) * ( rv1 - mean1 );  // moved from IF block

     if( rv2 > 0 ) 
     {
        devp += ( rv1 - mean1 ) * ( rv2 - mean2 );
        // dev1 += ( rv1 - mean1 ) * ( rv1 - mean1 ); // Why put this line here??
        dev2 += ( rv2 - mean2 ) * ( rv2 - mean2 );
     }
  }

gasevi commented 6 years ago

Hi @KunlinY ,

I think you are not considering that the iterator of v1 only gets elements different to zero, so rv1 are always going to take a valid rating value. does it make sense for you now ?.

Best,

GSV.

KunlinY commented 6 years ago

True. But when rv2 is 0 but rv1 not, dev1 misses the increment. I know here omits zero value for efficiency and that is totally fine. However, in Pearson alg, all the value in v1 should be summed up, no matter whether rv2 is 0 or not.

gasevi commented 6 years ago

"But when rv2 is 0 but rv1 not, dev1 misses the increment" ... Right !, because you only calculate the value of deviations when BOTH values ( rv1 AND rv2 ) are different to zero.

"The Pearson correlation coefficient is calculated by comparing ratings for all items rated by BOTH the target user and the neighbor"

KunlinY commented 6 years ago

Turns out that I misunderstand the Pearson similarity:( Appreciate your detailed answer!!!

gasevi commented 6 years ago

This is a special case because zeros are not really zeros as a value, but they are a "not information" value, so you can't use them to calculate anything. That's why you need both values are different to zero at the same time in order the result make sense. I think this is not related to the Pearson definition, but it is related to how to apply it to sparse vectors.