Open mihaiblidaru opened 2 years ago
I just discovered that there are official benchmarks. I ran then using github workflows which was a terrible idea because the VMs they run on are veeeeeeeery sloooooow.
There's not much improvement when benchmarking the ml-100k dataset Before | Movielens 100k | RMSE | MAE | Time |
---|---|---|---|---|
k-NN | 0.98 | 0.774 | 0:00:17 | |
Centered k-NN | 0.951 | 0.749 | 0:00:18 | |
k-NN Baseline | 0.931 | 0.733 | 0:00:21 |
After | Movielens 100k | RMSE | MAE | Time |
---|---|---|---|---|
k-NN | 0.98 | 0.774 | 0:00:16 | |
Centered k-NN | 0.951 | 0.749 | 0:00:17 | |
k-NN Baseline | 0.931 | 0.733 | 0:00:19 |
Using the ml-1m dataset my implementation is slower for some unknown reason and made me doubt the consistency of the performance of the github VMs.
Before | Movielens 1M | RMSE | MAE | Time |
---|---|---|---|---|
k-NN | 0.923 | 0.727 | 0:11:51 | |
Centered k-NN | 0.929 | 0.738 | 0:12:03 | |
k-NN Baseline | 0.895 | 0.706 | 0:11:36 |
After | Movielens 1M | RMSE | MAE | Time |
---|---|---|---|---|
k-NN | 0.923 | 0.727 | 0:13:31 | |
Centered k-NN | 0.929 | 0.738 | 0:13:28 | |
k-NN Baseline | 0.895 | 0.706 | 0:13:40 |
I ran the benchmarks again on a c5.2xlarge AWS EC2 instance and the benchmark results make more sense.
Before |
Movielens 1M | RMSE | MAE | Time |
---|---|---|---|---|
k-NN | 0.923 | 0.727 | 0:09:22 |
After | Movielens 1M | RMSE | MAE | Time |
---|---|---|---|---|
k-NN | 0.923 | 0.727 | 0:07:45 |
My guess is that the improvement is lower when running the full benchmark because most of the time is spent testing the algorithm and generating predictions rather than training it.
I'm trying to test multiple recommendation algorithms with a very big dataset (6M ratings, 69k users, 17k items) and I noticed that for algorithms that compute similarities the
fit
time is very high.I found that there are some redundant operations that could be eliminated in order to speed up the similarities' computation.
In
similarities.pyx
when computing auxiliary values, like the freq, prods arrays, becausey_ratings
is not sorted we have to fully iterate overy_ratings
two times resulting in N² operations for eachy_ratings
inyr
.We can take advantage of the fact that all auxiliary matrices are symmetric (M[x][y] == M[y][X]) so filling the half above diagonal is enough to compute the similarities.
My change is sorting the
y_ratings
lists before filling the auxiliary arrays and changing the second for loop ony_ratings
so that only the top half of the auxiliary matrices are filled. This reduces the number of operations for each element inyr
from N² to N(N-1)/2.The sorting might reduce performance for small datasets, but for larger datasets like movielens-1m the performance gains are very noticeable.
.fit
method call) a KNNBasic algorithm over 50 samples. Dataset: movielens-1mOf course, all tests pass successfully.