benfred / bens-blog-code

code snippets from my blog
http://www.benfrederickson.com/blog
313 stars 88 forks source link

Maximum inner product for Annoy #10

Closed erikbern closed 6 years ago

erikbern commented 6 years ago

Just read http://www.benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/

How are you running Annoy for maximum inner product? Annoy doesn't support it out of the box, so I wonder if that's the reason it doesn't perform well

benfred commented 6 years ago

It is weird with Annoy - I'm not sure why it didn't work well here (given that say for a cosine lookup instead of inner product on the same dataset I found it performed roughly the same as faiss).

The code to get annoy working with inner product search is here: https://github.com/benfred/implicit/blob/master/implicit/approximate_als.py . It just requires transforming the item factors so that each item has the same norm by adding an extra dimension (and then setting that extra dimension to 0 for users):

def augment_inner_product_matrix(factors):
    """ This function transforms a factor matrix such that an angular nearest neighbours search
    will return top related items of the inner product.
    This involves transforming each row by adding one extra dimension as suggested in the paper:
    "Speeding Up the Xbox Recommender System Using a Euclidean Transformation for Inner-Product
    Spaces" https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/XboxInnerProduct.pdf
    Basically this involves transforming each feature vector so that they have the same norm, which
    means the cosine of this transformed vector is proportional to the dot product (if the other
    vector in the cosine has a 0 in the extra dimension). """
    norms = numpy.linalg.norm(factors, axis=1)
    max_norm = norms.max()

    # add an extra dimension so that the norm of each row is the same
    # (max_norm)
    extra_dimension = numpy.sqrt(max_norm ** 2 - norms ** 2)
    return max_norm, numpy.append(factors, extra_dimension.reshape(norms.shape[0], 1), axis=1)

After applying that transformation to the item factors matrix, I indexed the transformed matrix with annoy and then query like https://github.com/benfred/implicit/blob/master/implicit/approximate_als.py#L248 .

The code to get this working with ann-benchmarks is a bit of a hack, and I don't think it will still work with your latest changes =( I just dumped out a npz file in the format that ann-benchmarks used to use with the transformed data: https://github.com/benfred/bens-blog-code/blob/master/approximate_als/create_ann_benchmarks_data.py#L51 . I think you changed to hdf5 so I doubt this will still work.

benfred commented 6 years ago

Totally unrelated, but I read your post this morning https://erikbern.com/2018/03/07/lessons-from-content-marketing-myself-aka-blogging-for-five-years.html - and totally agree with everything you said there.

One small thing though, even with more than 1K followers on feedly you can still get the exact count in a tooltip by hovering over the count:

screen shot 2018-03-09 at 11 11 55 am
erikbern commented 6 years ago

ok, let me think about that. are you doing the same thing for all the libraries, or is this in particular for annoy?

thanks for the feedly thing!

benfred commented 6 years ago

I'm doing the same thing for NMSLIB and Faiss in that post to be consistent. I could have special cased faiss (since it has an inner product index option) but it was easier to treat faiss the same as nmslib/annoy for that comparison with ann-benchmarks.

However, I'm using faiss's inner product index option in my implicit library for generating recommendations instead of doing the transform myself.

erikbern commented 6 years ago

ok, interesting, not sure why annoy performs so much worse than the other ones. if i have time i might try to reproduce your case (tempted to add it to ann-benchmarks actually)

erikbern commented 6 years ago

guess we can close this for now

erikbern commented 6 years ago

just rediscovered this thread because someone else brought up maximum inner product search for annoy in https://github.com/spotify/annoy/pull/303

random thought i had – wonder if it would make sense to add your dataset to ann-benchmarks?

benfred commented 6 years ago

@erikbern: I think its a good idea to add to ann-benchmarks! Someone else was pinging me about reproducing my results in that post - but I accidentally deleted my fork of ann-benchmarks with the changes required to reproduce 🙁

I created a PR here: https://github.com/erikbern/ann-benchmarks/pull/91