GPU Inference - Githubissues

tRosenflanz commented 4 years ago

Currently only training seems to work for GPU jobs. Inference should see even higher speed ups since it involves a large number of matrix multiplications. We replaced normal recommend_all with a simple implementation using tensorflow and saw speed ups in 40x+ range with exactly the same results as native implementation. It would be nice if the library supported gpu inference by default

benfred commented 3 years ago

Agreed - this would be awesome to add.

The last time I looked at this, running on the GPU massively sped up the calls to calculate the scores (vector/matrix dot product for a single user , or a matrix multiply for multiple users) - but the time needed to calculate the top-K items from the resulting vectors was really slow so there wasn't much of a win in overall performance. There have been performance improvements to cupy since I last looked back in 2017 - and I think we should come back to this (https://github.com/cupy/cupy/pull/763 ).

To get a speed boost on inference with the existing code, you can use the FaissAlternatingLeastSquares model on the gpu(see https://www.benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/ ). On the lastfm example running python examples/lastfm.py --model faiss_als --recommend takes only 3 minutes to the inference on my machine to recommend items for all 360K users vs about an hour with using the cpu inference code.

benfred commented 3 years ago

I added some basic gpu inference routines in #406 . This involves keeping the learned factors on the gpu as cupy arrays (instead of transferring back to host memory immediately) - and then using cupy for the inference routines. On my desktop the recommend/similar_items calls are about 10x faster than the cpu version using numpy. Most of the time is spent in cupy.argpartition - and I think there is some potential for improvement there.

This change ended up being fairly large - since I split the gpu/cpu code entirely into separate models (basically every method with the gpu version needs specialization, and I thought it would be cleaner long run to keep these separate). I've created factory functions with the same API as the previous AlternatingLeastSquares/BayesianPersonalizedRanking functions - so nothing should break with this change though.

tRosenflanz commented 3 years ago

This is great, thank you for adding this!

I had a quick look through the PR but didn't see the method for GPU recommend_all. Did I miss it and it is supported ? Should be even faster than 10x for multiple user batches.

benfred commented 3 years ago

Yeah - I think there are some future work here before we can totally close this out (and do a new release)

[ ] recommend_all
[x] cupy conda dependency for gpu builds
[ ] methods to convert to/from cpu/gpu (so that you can train on gpu and do inference on cpu if needed or viceversa)
[x] profiling against TF equivalent

I'm wondering about the perf of cp.argpartition compared to the topK support in TF. I ran a test with using cupy to calculate recommendations for a batch of users: this was basically the same speed as calculating the recommendations - and again is bottlenecked on the cp.argpartition call. I'm wondering how much faster the tensorflow 'top_k' function is relative to this.

benfred commented 3 years ago

I ran a bunch of tests comparing the top-k performance of TensorFlow, ArrayFire, faiss, and cupy - and the clear loser in these tests was cupy. While cupy is competitive with tf/faiss for small batch sizes (say querying for 1 row) it doesn't scale up to large batch sizes nearly as well, and for large batch sizes it can be 100x slower than faiss.

I've removed the cupy dependency in https://github.com/benfred/implicit/pull/458 and added some custom gpu code to do the top-k lookup. There is still some more work to do to increase the performance of this - but its already 20% faster than the cupy top-k code on the lastfm dataset querying results 1 item at a time, and about 5x faster than cupy with a batch size of a 100. This code is roughly the same as what TF is doing with large K values (> 100) - but we can be smarter about selecting items with small K values here.

benfred commented 3 years ago

There were also further speedups in #460 . Currently on my system this code is about 40x faster than CPU when querying for a single item at a time.

This GPU inference code is included in v0.4.8

We're still missing the recommend_all functionality here (which will have another major speedup) , I'm tracking here https://github.com/benfred/implicit/issues/477 now and going to call this issue done.

benfred commented 2 years ago

Things have further sped up by using the blockSelect algorithm from faiss in #506 , and we also have batch mode enabled in all calls here now.

benfred / implicit

GPU Inference #394