david-cortes / ctpfrec

Python implementation of "Content-based recommendations with poisson factorization", with some extensions
BSD 2-Clause "Simplified" License
30 stars 9 forks source link

ctpfrec is unable to perform out-of-matrix prediction #3

Open JackMack21 opened 3 years ago

JackMack21 commented 3 years ago

It appears that ctpfrec is unable to make out-of-matrix prediction, i.e. it can't recommend items without any ratings/clicks/plays/etc.

You did ask my to upload a toy dataset to show you which I am having trouble doing. I am also unable to upload the datasets I am using due to GDPR.

It is however very simple: I have three sets (in the required pandas triplet form {"UserId" : , "ItemId" : , "Count" : }) of user click data user_counts_train, user_counts_validation and user_counts_test, and another set word_counts for the items (in the required pandas triplet form {"ItemId" : , "WordId" : , "Count" : }). Importantly, there are no items in the three user sets that aren't in the word_counts set.

I fit my model using the training and validation sets:

recommender.fit(counts_df=user_counts_train, words_df=word_counts, val_set=user_counts_validation)

The issue is when I attempt to make an out-of-matrix prediction using an item that appears only in the user_counts_test and word_counts sets via:

new_user_count = pd.DataFrame({'UserId': 1.,'ItemId': [48081576,48081576,48081576],'Count': [1,1,2]}) # user clicks on item not in the training or validation sets
recommender4.add_users(new_user_count) # add new item to recommender4
recs = recommender4.topN(user = 1, n=k, exclude_seen = False) # output top k recommendations 

Is the issue with ctpfrec itself, or the way I am attempting to add a new user history and make predictions with topN?

Thank you

david-cortes commented 3 years ago

This is by design and it's controlled through the parameter missing_items.

JackMack21 commented 3 years ago

I forgot to add -- here is the log:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-b9ad7febc0b6> in <module>()
      1 new_user_count = pd.DataFrame({'UserId': 1.,'ItemId': [48081576,48081576,48081576],'Count': [1,1,2]})
----> 2 recommender4.add_users(new_user_count)
      3 recs = recommender4.topN(user = 1, n=k, exclude_seen = False) # think about excluding seen

/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in add_users(self, counts_df, user_df, maxiter, stop_thr, ncores, random_seed)
   1832                 elif (user_df is None) and (counts_df is not None):
   1833 
-> 1834                         counts_df, new_user_mapping = self._process_extra_df(counts_df, ttl='counts_df')
   1835                         counts_df['UserId'] -= self.nusers
   1836                         new_max_id = int(counts_df.UserId.max() + 1)

/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in _process_extra_df(self, df, ttl, df2)
    793                                         warnings.warn(msg)
    794                                 else:
--> 795                                         raise ValueError("'" + ttl + "' must contain " + subj2 + "s from the training set.")
    796 
    797                         new_ids1 = df[col1].unique()

ValueError: 'counts_df' must contain items from the training set.
JackMack21 commented 3 years ago

So am I right to assume that I should have missing_items='include' in the CTPF call?

david-cortes commented 3 years ago

Yes, if you want to make predictions about them, you should pass missing_items='include'.

JackMack21 commented 3 years ago

Thanks very much for your help

JackMack21 commented 3 years ago

I am still having issues with the out-of-matrix prediction. After passing missing_items='include':

recommender4 = CTPF(k = 100, missing_items = 'include', random_seed = 123, ncores = -1)
recommender4.fit(counts_df=user_counts_train, words_df=word_counts, val_set=user_counts_validation) # with validation set

Then attempting to add a new user with:

new_user_count = pd.DataFrame({'UserId': -1,'ItemId': [48081576,48081576,48081576],'Count': [1,1,2]})
recommender4.add_users(new_user_count)

I'm presented with the log:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-6b2201af57f5> in <module>()
      1 new_user_count = pd.DataFrame({'UserId': -1,'ItemId': [48081576,48081576,48081576],'Count': [1,1,2]})
----> 2 recommender4.add_users(new_user_count)
      3 # recs = recommender4.topN(user = 1, n=k, exclude_seen = False) # think about excluding seen

/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in add_users(self, counts_df, user_df, maxiter, stop_thr, ncores, random_seed)
   1832                 elif (user_df is None) and (counts_df is not None):
   1833 
-> 1834                         counts_df, new_user_mapping = self._process_extra_df(counts_df, ttl='counts_df')
   1835                         counts_df['UserId'] -= self.nusers
   1836                         new_max_id = int(counts_df.UserId.max() + 1)

/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in _process_extra_df(self, df, ttl, df2)
    793                                         warnings.warn(msg)
    794                                 else:
--> 795                                         raise ValueError("'" + ttl + "' must contain " + subj2 + "s from the training set.")
    796 
    797                         new_ids1 = df[col1].unique()

ValueError: 'counts_df' must contain items from the training set.

The ItemId [48081576] isn't in the user_counts_train data frame, but is in the word_counts data frame.

david-cortes commented 3 years ago

I'm unable to reproduce. The following runs correctly on my machine:

import numpy as np, pandas as pd
from ctpfrec import CTPF

words_count = pd.DataFrame({
    "ItemId" : [1,2,3,4],
    "WordId" : [1,1,2,2],
    "Count" :  [3,3,3,3]
})
counts_train = pd.DataFrame({
    "UserId" : [1,2,3,4],
    "ItemId" : [1,1,2,2],
    "Count"  : [3,3,3,3]
})
counts_val = pd.DataFrame({
    "UserId" : [1,2,3,4],
    "ItemId" : [3,3,3,3],
    "Count"  : [3,3,3,3]
})
new_user_df = pd.DataFrame({
    "UserId" : [5],
    "ItemId" : [4],
    "Count"  : [3]
})

model = CTPF(k = 2, missing_items = 'include', verbose=False)
model.fit(counts_df=counts_train, words_df=words_count, val_set=counts_val)
model.add_users(counts_df=new_user_df)