Significantly improve performance of SVD and SVDpp

I made some small changes, mostly to the matrix_factorization.pyx file, that drastically improve performance of the SVD and SVDpp algorithms. Most of the speed-up is evident in the SVDpp algorithm.

Most of the improvement comes from rewriting some loop constructs to avoid Python code as much as possible. This mostly involved changing for _ in to while constructs, please see matrix_factorization.pyx for more context.

On my machine (M1 MacBook Air) the benchmarks produce the following results:

Original (current) version

Movielens 100k	RMSE	MAE	Time	Time (new)	Change
SVD	0.936	0.738	0:00:11	0:00:02	5.5x
SVD++	0.922	0.723	0:06:54	0:00:31	13.35x
NMF	0.964	0.758	0:00:11	0:00:04	2.75x
Slope One	0.946	0.743	0:00:06	0:00:06
k-NN	0.98	0.774	0:00:06	0:00:06
Centered k-NN	0.951	0.749	0:00:06	0:00:06
k-NN Baseline	0.931	0.733	0:00:07	0:00:07
Co-Clustering	0.965	0.755	0:00:02	0:00:02
Baseline	0.944	0.748	0:00:00	0:00:00
Random	1.523	1.222	0:00:00	0:00:00

I haven't spent the time to run the 1M benchmark for the original code, but for the update it produces:	Movielens 1M	RMSE	MAE	Time
SVD	0.874	0.686	0:00:24	12x
SVD++	0.862	0.672	0:10:39	20.61x
NMF	0.917	0.725	0:00:44	11x
Slope One	0.907	0.715	0:02:15
k-NN	0.923	0.727	0:04:44
Centered k-NN	0.929	0.738	0:04:48
k-NN Baseline	0.895	0.706	0:04:54
Co-Clustering	0.916	0.718	0:00:32
Baseline	0.909	0.719	0:00:10
Random	1.506	1.207	0:00:10

Using the above table, along with the original benchmarks, we can determine that when switching from the 100k dataset to the 1M dataset the:

original SVD algorithm becomes 12.1x times slower, new becomes 12x slower.
original SVDpp becomes 19.26x slower, new becomes 20.61x slower
original NMF becomes 10.07x slower, new becomes 11x slower

Honestly, I'm not exactly sure why NMF is faster... Both the original code and the new code were compiled from source before running the benchmarks.

The difference with SVDpp might need to be investigated, but considering a base speed-up of 13.35x the reduced performance as the dataset size increases might be a worthwhile tradeoff.

Note: I haven't written any new tests, since fundamentally the algorithm remains unchanged. All tests that passed before my changes still pass after my changes (roughly 2x as fast).

NicolasHug / Surprise

Significantly improve performance of SVD and SVDpp #400

Original (current) version