Improve similarities performance

I'm trying to test multiple recommendation algorithms with a very big dataset (6M ratings, 69k users, 17k items) and I noticed that for algorithms that compute similarities the fit time is very high.

I found that there are some redundant operations that could be eliminated in order to speed up the similarities' computation.

In similarities.pyx when computing auxiliary values, like the freq, prods arrays, because y_ratings is not sorted we have to fully iterate over y_ratings two times resulting in N² operations for each y_ratings in yr.

    for y, y_ratings in yr.items():
        for xi, ri in y_ratings:
            for xj, rj in y_ratings:
                freq[xi, xj] += 1
                prods[xi, xj] += ri * rj
                sqi[xi, xj] += ri**2
                sqj[xi, xj] += rj**2

We can take advantage of the fact that all auxiliary matrices are symmetric (M[x][y] == M[y][X]) so filling the half above diagonal is enough to compute the similarities.

My change is sorting the y_ratings lists before filling the auxiliary arrays and changing the second for loop on y_ratings so that only the top half of the auxiliary matrices are filled. This reduces the number of operations for each element in yr from N² to N(N-1)/2.

    sorted_yr = { y : sorted(y_ratings, key = lambda x: x[0]) for y, y_ratings in yr.items() }

    for y, y_ratings in sorted_yr.items():
        for i, (xi, ri) in enumerate(y_ratings):
            for xj, rj in y_ratings[i + 1:]:

The sorting might reduce performance for small datasets, but for larger datasets like movielens-1m the performance gains are very noticeable.

Some performance tests I did on my laptop (Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz). The time in the next tables are the average time it took to train (just the `.fit` method call) a KNNBasic algorithm over 50 samples. Dataset: movielens-1m		Before (seconds)	After (seconds)
cosine	32.58	13.78	42.29 %
msd	29.94	11.14	37.20 %
pearson	38.21	16.98	44.43 %
pearson_baseline	38.26	17.86	46.68 %

Dataset: movielens-100k		Before(seconds)	After(seconds)
cosine	0.46	0.20	43.47 %
msd	0.28	0.15	53.57 %
pearson	0.58	0.27	46.55 %
pearson_baseline	0.66	0.42	63.63 %

Of course, all tests pass successfully.

I just discovered that there are official benchmarks. I ran then using github workflows which was a terrible idea because the VMs they run on are veeeeeeeery sloooooow.

There's not much improvement when benchmarking the ml-100k dataset Before	Movielens 100k	RMSE	MAE
k-NN	0.98	0.774	0:00:17
Centered k-NN	0.951	0.749	0:00:18
k-NN Baseline	0.931	0.733	0:00:21

After	Movielens 100k	RMSE	MAE
k-NN	0.98	0.774	0:00:16
Centered k-NN	0.951	0.749	0:00:17
k-NN Baseline	0.931	0.733	0:00:19

Using the ml-1m dataset my implementation is slower for some unknown reason and made me doubt the consistency of the performance of the github VMs.

Before	Movielens 1M	RMSE	MAE
k-NN	0.923	0.727	0:11:51
Centered k-NN	0.929	0.738	0:12:03
k-NN Baseline	0.895	0.706	0:11:36

After	Movielens 1M	RMSE	MAE
k-NN	0.923	0.727	0:13:31
Centered k-NN	0.929	0.738	0:13:28
k-NN Baseline	0.895	0.706	0:13:40

I ran the benchmarks again on a `c5.2xlarge` AWS EC2 instance and the benchmark results make more sense. Before	Movielens 1M	RMSE	MAE	Time
k-NN	0.923	0.727	0:09:22

After	Movielens 1M	RMSE	MAE	Time
k-NN	0.923	0.727	0:07:45

My guess is that the improvement is lower when running the full benchmark because most of the time is spent testing the algorithm and generating predictions rather than training it.

NicolasHug / Surprise

Improve similarities performance #448