bogliosimone / similaripy

Fast Python KNN-Similarity algorithms for Collaborative Filtering models in Recommender System and others.
MIT License
53 stars 3 forks source link

What is the sense of k in dot product? #1

Closed federicoparroni closed 5 years ago

federicoparroni commented 5 years ago

Reading the example in the readme:

import similaripy as sim
import scipy.sparse as sps

# create a random user-rating matrix (URM)
urm = sps.random(1000, 2000, density=0.025)

# train the model with 50 knn per item 
model = sim.cosine(urm.T, k=50)

# recommend 100 items to users 1, 14 and 8
user_recommendations = sim.dot_product(urm, model, target_rows=[1,14,8], k=100)

I have a doubt in the usage of the k param on the last row.. What is its sense? Can I use the dot product as the standard dot product between 2 matrices? Can you clarify this please?

bogliosimone commented 5 years ago

hi, the parameter k is the number of the nearest neighbors per row in the computation (in this case, for a simple dot product, the top k values are the highest k values per row computed in the matrices product)

if you need the standard dot product with all elements in the result you could just set k=urm.shape[0] (row length) but keep in mind that in this case you obtain as result dense rows with lot of zeros (so, depending on the size and density of the dataset, it could require a certain amount of memory, also you lose the advantage of using sparse matrices)

federicoparroni commented 5 years ago

Yep, I was doing exactly how you suggested! So, in this case, do you think that I should use simply the dot function of csr matrices? Will it be faster?

bogliosimone commented 5 years ago

Sure, you could use the dot product of scipy, about the performance question, I think the scipy function is a little bit faster because it doesn't need to check which of the top k values per row keep during the computation (because simply it keeps them all).

In general Similaripy functions are useful in those case in which you can't compute the full product/similarity matrix because it require too much space in memory (or because you need only the top k values).

federicoparroni commented 5 years ago

Thank you so much!!

bogliosimone commented 5 years ago

You are welcome :)

If you found my work useful, you could leave me a star, thanks :)

federicoparroni commented 5 years ago

I'm a student at Polimi and your work help me so much for the Rec Sys course :)