NicolasHug / Surprise

A Python scikit for building and analyzing recommender systems
http://surpriselib.com
BSD 3-Clause "New" or "Revised" License
6.42k stars 1.02k forks source link

Improve similarities performance #448

Open mihaiblidaru opened 2 years ago

mihaiblidaru commented 2 years ago

I'm trying to test multiple recommendation algorithms with a very big dataset (6M ratings, 69k users, 17k items) and I noticed that for algorithms that compute similarities the fit time is very high.

I found that there are some redundant operations that could be eliminated in order to speed up the similarities' computation.

In similarities.pyx when computing auxiliary values, like the freq, prods arrays, because y_ratings is not sorted we have to fully iterate over y_ratings two times resulting in N² operations for each y_ratings in yr.

    for y, y_ratings in yr.items():
        for xi, ri in y_ratings:
            for xj, rj in y_ratings:
                freq[xi, xj] += 1
                prods[xi, xj] += ri * rj
                sqi[xi, xj] += ri**2
                sqj[xi, xj] += rj**2            

We can take advantage of the fact that all auxiliary matrices are symmetric (M[x][y] == M[y][X]) so filling the half above diagonal is enough to compute the similarities.

My change is sorting the y_ratings lists before filling the auxiliary arrays and changing the second for loop on y_ratings so that only the top half of the auxiliary matrices are filled. This reduces the number of operations for each element in yr from N² to N(N-1)/2.

    sorted_yr = { y : sorted(y_ratings, key = lambda x: x[0]) for y, y_ratings in yr.items() }

    for y, y_ratings in sorted_yr.items():
        for i, (xi, ri) in enumerate(y_ratings):
            for xj, rj in y_ratings[i + 1:]:

The sorting might reduce performance for small datasets, but for larger datasets like movielens-1m the performance gains are very noticeable.

Some performance tests I did on my laptop (Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz). The time in the next tables are the average time it took to train (just the .fit method call) a KNNBasic algorithm over 50 samples. Dataset: movielens-1m Before (seconds) After (seconds) Time reduction (After/Before * 100)
cosine 32.58 13.78 42.29 %
msd 29.94 11.14 37.20 %
pearson 38.21 16.98 44.43 %
pearson_baseline 38.26 17.86 46.68 %
Dataset: movielens-100k Before(seconds) After(seconds) Time reduction(After/Before * 100)
cosine 0.46 0.20 43.47 %
msd 0.28 0.15 53.57 %
pearson 0.58 0.27 46.55 %
pearson_baseline 0.66 0.42 63.63 %

Of course, all tests pass successfully.

mihaiblidaru commented 2 years ago

I just discovered that there are official benchmarks. I ran then using github workflows which was a terrible idea because the VMs they run on are veeeeeeeery sloooooow.

There's not much improvement when benchmarking the ml-100k dataset Before Movielens 100k RMSE MAE Time
k-NN 0.98 0.774 0:00:17
Centered k-NN 0.951 0.749 0:00:18
k-NN Baseline 0.931 0.733 0:00:21
After Movielens 100k RMSE MAE Time
k-NN 0.98 0.774 0:00:16
Centered k-NN 0.951 0.749 0:00:17
k-NN Baseline 0.931 0.733 0:00:19

Using the ml-1m dataset my implementation is slower for some unknown reason and made me doubt the consistency of the performance of the github VMs.

Before Movielens 1M RMSE MAE Time
k-NN 0.923 0.727 0:11:51
Centered k-NN 0.929 0.738 0:12:03
k-NN Baseline 0.895 0.706 0:11:36
After Movielens 1M RMSE MAE Time
k-NN 0.923 0.727 0:13:31
Centered k-NN 0.929 0.738 0:13:28
k-NN Baseline 0.895 0.706 0:13:40
I ran the benchmarks again on a c5.2xlarge AWS EC2 instance and the benchmark results make more sense. Before Movielens 1M RMSE MAE Time
k-NN 0.923 0.727 0:09:22
After Movielens 1M RMSE MAE Time
k-NN 0.923 0.727 0:07:45

My guess is that the improvement is lower when running the full benchmark because most of the time is spent testing the algorithm and generating predictions rather than training it.