benfred / implicit

Fast Python Collaborative Filtering for Implicit Feedback Datasets
https://benfred.github.io/implicit/
MIT License
3.57k stars 612 forks source link

many code issues in the tutorial of the documentation #544

Closed hugocool closed 2 years ago

hugocool commented 2 years ago

In the tutorial for this package there are several issues. according to the tutorial code:

from implicit.datasets.lastfm import get_lastfm

artists, users, artist_user_plays = get_lastfm()

from implicit.nearest_neighbours import bm25_weight

# weight the matrix, both to reduce impact of users that have played the same artist thousands of times
# and to reduce the weight given to popular items
artist_user_plays = bm25_weight(artist_user_plays, K1=100, B=0.8)

# get the transpose since the most of the functions in implicit expect (user, item) sparse matrices instead of (item, user)
user_plays = artist_user_plays.T.tocsr()

from implicit.als import AlternatingLeastSquares

model = AlternatingLeastSquares(factors=64, regularization=0.05)
model.fit(2 * user_plays)

The first is

model.fit(2 * user_plays)

Why 2* user_plays? If there is a reason the confidence weights should be doubled for this implementation of the algorithm it should be documented.

Secondly

userid = 12345
ids, scores = model.recommend(userid, user_plays[userid])

results in


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_52/3490455469.py in <module>
      1 userid = 12345
----> 2 ids, scores = model.recommend(userid, user_plays[userid])

implicit/recommender_base.pyx in implicit.recommender_base.MatrixFactorizationBase.recommend()

/opt/conda/lib/python3.7/site-packages/scipy/sparse/_index.py in __getitem__(self, key)
     31     """
     32     def __getitem__(self, key):
---> 33         row, col = self._validate_indices(key)
     34         # Dispatch to specialized methods.
     35         if isinstance(row, INT_TYPES):

/opt/conda/lib/python3.7/site-packages/scipy/sparse/_index.py in _validate_indices(self, key)
    132             row = int(row)
    133             if row < -M or row >= M:
--> 134                 raise IndexError('row index (%d) out of range' % row)
    135             if row < 0:
    136                 row += M

IndexError: row index (12345) out of range

idk exactly what is going here. It could be that a newer version of numpy/scipy is yielding a different shape of sparse arrays then you expected?

In addition, the recommend similar items code:

ids, scores= model.similar_items(252512)

results in


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_34/4089806412.py in <module>
----> 1 ids, scores= model.similar_items(252512)

ValueError: too many values to unpack (expected 2)

Because model.similar_items(252512) yields

[(252512, 0.99999994),
 (322476, 0.9348197),
 (346679, 0.93439937),
 (15681, 0.932505),
 (6190, 0.93157953),
 (48315, 0.93144983),
 (299930, 0.92641133),
 (179122, 0.92457074),
 (196575, 0.9240719),
 (303211, 0.92251533)]

so these arrays should be unpacked and rezipped as follows: ids, scores=zip(*model.similar_items(252512))

Now showing the similar artists should be:

display the results using pandas for nicer formatting

pd.DataFrame({"artist": artists[list(ids)], "score": scores})

However this results in out of bound indices:

IndexError: index 322476 is out of bounds for axis 0 with size 292385

because the recommender is recommending items that don't exist. Also the in-bound recommendations don't make any sense, nor do they conform to the expected output..

benfred commented 2 years ago

I'm pretty sure the tutorial works, but you will need the latest version of implicit installed.

There were many breaking API changes in v0.5.0, and the tutorial is built against the newer API. It looks to me like you have an older version of implicit installed based off the error messages you're reporting.

Can you verify that you have the latest version of implicit installed? What does this print out ?

import implicit
print(implicit.__version__)
hugocool commented 2 years ago

ahh, okay i just checked and the version I was on was 0.4.4 But the install hangs for some reason, so when I run !pip install implicit --upgrade in a cloud notebook(whether it is colab, kaggle or sagemaker), the install cell just hanged for half an hour. Any ideas why installing the new version doesn't work? The old version installed just fine.

oh That leaves the 2*user_plays though, why is that necessary?

benfred commented 2 years ago

!pip install implicit --upgrade in a cloud notebook(whether it is colab, kaggle or sagemaker), the install cell just hanged for half an hour.

Using pip will compile from source right now, which can take a long amount of time. We're tracking uploading prebuilt binaries to pip here https://github.com/benfred/implicit/issues/539.

One thing you do to speed up compilation is to only build for the current GPU architecture. There are some tips here https://github.com/benfred/implicit/issues/537

That leaves the 2*user_plays though, why is that necessary?

The '2' corresponds to the alpha parameter in the original paper. This is giving more weight to positive examples.

hugocool commented 2 years ago

thanks for the clarification!

Would it be a good idea to add a note to the documentation (for example the top of the tutorial) to point out one should install the latest version with a specified set of flags? This could save you the trouble of replying to these issues (for which I am obviously grateful, thanks for this package!)

benfred commented 2 years ago

I've add binary wheels to pypi - you should be able to install implicit on colab/kaggle etc in a couple seconds now, with the GPU extension built.

Would it be a good idea to add a note to the documentation (for example the top of the tutorial) to point out one should install the latest version with a specified set of flags?

The API hopefully won't change again soon - lets wait and see how many times this occurs =).

Glad you're finding this useful!