benfred / implicit

Fast Python Collaborative Filtering for Implicit Feedback Datasets
https://benfred.github.io/implicit/
MIT License
3.57k stars 612 forks source link

Checking on Performance - Very Poor Versus LightFM #647

Closed BrianMiner closed 1 year ago

BrianMiner commented 1 year ago

I have run a toy example using movie lenses with both Implicit and lightfm (and a couple other methods). The performance of Implicit is so relatively poor that I think I must be doing something wrong. Here is the self contained code for both libraries:

import pandas as pd
import numpy as np
import implicit
from lightfm import LightFM
from scipy.sparse import csr_matrix, coo_matrix

READ AND PREPARE DATA

from lightfm.datasets import fetch_movielens

data = fetch_movielens(min_rating=5)

# train and test df
pdf_train = pd.DataFrame({'user': data['train'].nonzero()[0], 'item': data['train'].nonzero()[1]})
pdf_test = pd.DataFrame({'user': data['test'].nonzero()[0], 'item': data['test'].nonzero()[1]})

# test only contain items from train
pdf_test = pdf_test.merge(pdf_train.item.drop_duplicates(), on = 'item')
# test only contain users from train
pdf_test = pdf_test.merge(pdf_train.user.drop_duplicates(), on = 'user')

# map items to consequetive ints
items_pdf = pd.DataFrame(pdf_train.item.unique()).reset_index()
items_pdf.columns  = ['value', 'item']
dct_item_to_int = dict(zip(items_pdf.item.values, items_pdf.value.values ))

pdf_train['item'] = pdf_train['item'] = pdf_train.item.map(dct_item_to_int).values.astype(np.int)
pdf_test['item'] = pdf_test.item.map(dct_item_to_int).values.astype(np.int)

# shuffle train
pdf_train = pdf_train.sample(frac = 1, replace = False)

# assumes labeled 0 to n-1
n_users = pdf_train['user'].max()+1
n_items = pdf_train['item'].max()+1

test_users = pdf_test.user.sort_values().unique()
test_items = pdf_train.item.sort_values().unique()  # can only predict items in train for test set

pdf_test_array = pdf_test.groupby('user')['item'].apply(np.array)

np.setdiff1d(test_users, pdf_train.user.sort_values().unique())

IMPLICIT

item_user = csr_matrix((np.ones(pdf_train.shape[0]), (pdf_train.item.values, pdf_train.user.values)))
user_item = csr_matrix((np.ones(pdf_train.shape[0]), (pdf_train.user.values, pdf_train.item.values)))

# initialize a model
model = implicit.als.AlternatingLeastSquares(factors=50, iterations = 15)

# train the model on a sparse matrix of item/user/confidence weights
model.fit(item_user)

pred_dct = {}
for user in test_users:
    pred_dct[user] = model.recommend(user, user_item[user], filter_already_liked_items = False, N=5)[0]

print(f'Hit Rate: {np.mean([int(len(set(np.intersect1d(pdf_test_array.loc[u], pred_dct[u]))) > 0) for u in test_users])}')
print(f'Recall Rate: {np.mean([len(set(np.intersect1d(pdf_test_array.loc[u], pred_dct[u]))) / len(pdf_test_array.loc[u]) for u in test_users])}')

Hit Rate: 0.016216216216216217 Recall Rate: 0.005363577863577863

LIGHT FM

user_item = coo_matrix((np.ones(pdf_train.shape[0]), (pdf_train.user.values, pdf_train.item.values)))

model = LightFM(loss='bpr')
model.fit(user_item, epochs=30, num_threads=2)

pred_dct = {}

for user in test_users:
    preds= model.predict(int(user), test_items)
    ind = np.argpartition(preds, -5)[-5:]
    pred_dct[int(user)] = test_items[ind]

print(f'Hit Rate: {np.mean([int(len(set(np.intersect1d(np.array(pdf_test_array.loc[u]), pred_dct[u]))) > 0) for u in test_users])}')
print(f'Recall Rate: {np.mean([len(set(np.intersect1d(np.array(pdf_test_array.loc[u]), pred_dct[u]))) / len(pdf_test_array.loc[u]) for u in test_users])}')

Hit Rate: 0.19864864864864865 Recall Rate: 0.08021610896610895

BrianMiner commented 1 year ago

OK I think I know the issue. The documentation of the library seems to be very inconsistent. I googled and found https://implicit.readthedocs.io/en/latest/models.html

where fit calls for item_user which I created above. Instead, it seems to be that a user_item matrix is needed. Making this changed created a model on par with other libs.

benfred commented 1 year ago

@BrianMiner - the RTD docs site isn't being used anymore and is out of date. Up to date docs are at https://benfred.github.io/implicit/

Thanks for bringing this up - I've removed the site at https://implicit.readthedocs.io/ so that other people won't hit the same issue you had