benfred / implicit

Fast Python Collaborative Filtering for Implicit Feedback Datasets
https://benfred.github.io/implicit/
MIT License
3.57k stars 612 forks source link

lastfm tutorial performs transpose on data, but generation code no longer requires it #644

Open alastair opened 1 year ago

alastair commented 1 year ago

Hi,

In the lastfm tutorial, there is a specific step

# get the transpose since the most of the functions in implicit expect (user, item) sparse matrices instead of (item, user)
user_plays = artist_user_plays.T.tocsr()

However it looks like this may no longer be necessary in some cases. In https://github.com/benfred/implicit/commit/32c06aa669f7597d69c1a9a1c56cf1a1d0c5f1ce#diff-b8a4c78fbfcc629a3d35255010d1a4ae21d5909664b8d3c1283da18359ae5a0aL77-R77 some changes were made which also swapped the order of users/artists when building the sparse matrix. Therefore, if we generate a new copy of the hdf5 from the source data file, the artist_user_plays matrix is already in the correct orientation.

However, it does seem that the binary hdf5 file which is downloaded by the tool was generated with the older version of this code, which is still using the (artist, user) format.

It seems like to reduce confusion it would be a good idea to re-generate the binary hdf5 and remove the transform from the tutorial, or revert the dimension change in the matrix generation step.

benfred commented 1 year ago

Yeah - thats a great callout. The datasets were generated before the API refactor in #481 - and we really should generate new ones with transposed data.