Significantly improve performance of SVD and SVDpp (cleaner)

ProfHercules commented 2 years ago

All changes are documented more clearly in the commits. Before starting to make changes, I ran all tests using Python 3.9, they all passed. After all changes, all the tests still pass.

Initially, on my M1 MacBook Air, examples/bench_mf.py produced the following:

Movielens 100k	RMSE	MAE	Time
SVD	0.936	0.738	0:00:12
SVD++	0.922	0.723	0:07:08
NMF	0.964	0.758	0:00:11

After all changes, the run time looks like this:

Movielens 100k	RMSE	MAE	Time
SVD	0.936	0.738	0:00:02
SVD++	0.922	0.723	0:00:32
NMF	0.964	0.758	0:00:11

Which is a 6x improvement in speed for SVD and a 13.38x improvement for SVDpp.

I also ran the same benchmark using the MovieLens 1M dataset, and got the following:

Movielens 1M	RMSE	MAE	Time
SVD	0.874	0.686	0:00:25
SVD++	0.862	0.672	0:10:37
NMF	0.917	0.725	0:01:49

The primary differences between this PR and #400 is

the commit history is cleaner
the code is changed as little as possible, to make changes clearer
memory usage is largely unchanged (previous implementation sort-of doubled memory usage 😅).
only the algorithms with code changes have differing performance.

Note: I documented the process here, feel free to give it a read!

NicolasHug commented 2 years ago

@ProfHercules Thank you so much for your dedication and for taking the time to submit a PR. I'm sorry I couldn't get to it sooner.

Since you submitted, I've made a bunch of modifications on the master branch. In master I have removed some old code relying on six, which is where a bunch of Python interaction were coming from when using range() in Cython. Also I enabled by default a bunch of compiler directive like the ones you were using here (bouncheck=False, etc.). I also updated the CI so that it can run small benchmarks when submitting a PR (it's the "Benchmark / build (3.9) (pull_request)" below).

Before changes on master: https://github.com/NicolasHug/Surprise/pull/422

Movielens 100k	RMSE	MAE	Time
SVD	0.936	0.738	0:00:30
SVD++	0.922	0.723	0:16:16
NMF	0.964	0.758	0:00:30

After changes on `master` https://github.com/NicolasHug/Surprise/pull/421	Movielens 100k	RMSE	MAE
SVD	0.936	0.738	0:00:07
SVD++	0.922	0.723	0:05:24
NMF	0.964	0.758	0:00:09

This PR:

Movielens 100k	RMSE	MAE	Time
SVD	0.936	0.738	0:00:05
SVD++ cache_ratings=False	0.922	0.723	0:01:54
SVD++ cache_ratings=True	0.921	0.722	0:01:38
NMF	0.965	0.759	0:00:08

I've also made a bunch of changes to this PR. Some of them are cosmetic, some of them are maybe more relevant:

removed the caching the of the sqrt(Iu_length) as I didn't notice any significant performance hit, and it saves of bit of memory
put back the range() calls since they don't generate interactions anymore (note: to match the Cython options from setup.py we need to use -X wraparound=False,boundscheck=False,cdivision=True,language_level=3 now).
I removed the use of C++: I started removing the map to just use a vector[vector[int]] but in the end a good old malloced int ** could work just as well here, so I went for that. Hopefully I didn't introduce any memory leak.
The original code in this PR requires duplicating n_ratings ints (the Ius), so it can be a bit expensive in terms of memory. So I implemented two version: cache_ratings=True which corresponds to your strategy, where all ratings are cached, and cache_ratings=False where we still retrieve Iu on the fly, but in a much more efficient data-structure and with fewer Python interations. Both are a lot faster than master, and cache_ratings=True is still a bit faster thancache_ratings=False - for a higher memory footprint.

On the benchmark above we observe similar speedups as you did: 6X for SVD and ~10X for SVDpp.

NicolasHug commented 2 years ago

Merging now, thanks a lot again for your PR @ProfHercules , this is a very nice improvement!

NicolasHug / Surprise

Significantly improve performance of SVD and SVDpp (cleaner) #401