Closed erikbern closed 6 years ago
It is weird with Annoy - I'm not sure why it didn't work well here (given that say for a cosine lookup instead of inner product on the same dataset I found it performed roughly the same as faiss).
The code to get annoy working with inner product search is here: https://github.com/benfred/implicit/blob/master/implicit/approximate_als.py . It just requires transforming the item factors so that each item has the same norm by adding an extra dimension (and then setting that extra dimension to 0 for users):
def augment_inner_product_matrix(factors):
""" This function transforms a factor matrix such that an angular nearest neighbours search
will return top related items of the inner product.
This involves transforming each row by adding one extra dimension as suggested in the paper:
"Speeding Up the Xbox Recommender System Using a Euclidean Transformation for Inner-Product
Spaces" https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/XboxInnerProduct.pdf
Basically this involves transforming each feature vector so that they have the same norm, which
means the cosine of this transformed vector is proportional to the dot product (if the other
vector in the cosine has a 0 in the extra dimension). """
norms = numpy.linalg.norm(factors, axis=1)
max_norm = norms.max()
# add an extra dimension so that the norm of each row is the same
# (max_norm)
extra_dimension = numpy.sqrt(max_norm ** 2 - norms ** 2)
return max_norm, numpy.append(factors, extra_dimension.reshape(norms.shape[0], 1), axis=1)
After applying that transformation to the item factors matrix, I indexed the transformed matrix with annoy and then query like https://github.com/benfred/implicit/blob/master/implicit/approximate_als.py#L248 .
The code to get this working with ann-benchmarks is a bit of a hack, and I don't think it will still work with your latest changes =( I just dumped out a npz file in the format that ann-benchmarks used to use with the transformed data: https://github.com/benfred/bens-blog-code/blob/master/approximate_als/create_ann_benchmarks_data.py#L51 . I think you changed to hdf5 so I doubt this will still work.
Totally unrelated, but I read your post this morning https://erikbern.com/2018/03/07/lessons-from-content-marketing-myself-aka-blogging-for-five-years.html - and totally agree with everything you said there.
One small thing though, even with more than 1K followers on feedly you can still get the exact count in a tooltip by hovering over the count:
ok, let me think about that. are you doing the same thing for all the libraries, or is this in particular for annoy?
thanks for the feedly thing!
I'm doing the same thing for NMSLIB and Faiss in that post to be consistent. I could have special cased faiss (since it has an inner product index option) but it was easier to treat faiss the same as nmslib/annoy for that comparison with ann-benchmarks.
However, I'm using faiss's inner product index option in my implicit library for generating recommendations instead of doing the transform myself.
ok, interesting, not sure why annoy performs so much worse than the other ones. if i have time i might try to reproduce your case (tempted to add it to ann-benchmarks actually)
guess we can close this for now
just rediscovered this thread because someone else brought up maximum inner product search for annoy in https://github.com/spotify/annoy/pull/303
random thought i had – wonder if it would make sense to add your dataset to ann-benchmarks?
@erikbern: I think its a good idea to add to ann-benchmarks! Someone else was pinging me about reproducing my results in that post - but I accidentally deleted my fork of ann-benchmarks with the changes required to reproduce 🙁
I created a PR here: https://github.com/erikbern/ann-benchmarks/pull/91
Just read http://www.benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/
How are you running Annoy for maximum inner product? Annoy doesn't support it out of the box, so I wonder if that's the reason it doesn't perform well