Open deil87 opened 3 months ago
I think the problem here is that cosine similarity is failing to see the difference when we have 1D support vector
To highlight my point I will use 5 vs 1 ratings instead of 3 and 2 in my example above.
a= [5], b= [1]
(a·b) / (‖a‖ × ‖b‖) = 5 / (5.000000 × 1.000000) = 1.000000
Even if we had second dimension with equal values we would get a slightly better result ( it would start to notice some difference between vectors)
a= [3,1], b= [2,1]
a·b = 5×1 + 1×1 = 6.
‖a‖ = √[5² + 1²] = 5.099020 and
‖b‖ = √[1² + 1²] = 1.414214
(a·b) / (‖a‖ × ‖b‖) = 6 / (5.099020 × 1.414214) = 0.832050.
So I believe that min_support
should be always >= 2
when cosine similarity is specified in parameters.
Description
I was trying to create my custom algo and before moving to something more complicated I wanted to make sure I understand how current code is working. I couldn't understand why Cosine similarity between two particular users is 1.0 whereas they have different ratings for common item.
Steps/Code to Reproduce
Preparing the data ( taking subsample of 20K to make exploration/investigation faster):
Then I'm running my new custom algo:
Expected Results
In the printout ( the one that is provided in docs ) I was looking for some neigbors with non zero similarity and for example took this one:
Note that I also added logging for the current estimate function parameters to know which item we are predicting for ( 428 in this example)
So I see that algo considered user 171 and 1123 to be similar. I decided to check it manually.
As
"user_based": True
then we are calculating similarity between user 171 and other users, that have ratings for 428.So i checked
trainset_full.ir[428]
Output:Then I decided to check rating for these 2 users
171
and1123
to see whether they have similar ratings for common items.I found only one common item and the ratings are different
I didn't specify
min_support
but it shouldn't matter as when we below it we get 0. It means we are greater or equal thanmin_support
value ( probably default is 1 ).I would expect similarity not to be equal 1 as ratings are not the same ( 2.0 vs 3.0)
Actual Results
The 3 nearest neighbours of user 171 are: user 1123 with sim 1.00
Versions
macOS-10.16-x86_64-i386-64bit Python 3.10.14 (main, May 6 2024, 14:47:20) [Clang 14.0.6 ] surprise 1.1.4