Open JackMack21 opened 3 years ago
This is by design and it's controlled through the parameter missing_items
.
I forgot to add -- here is the log:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-b9ad7febc0b6> in <module>()
1 new_user_count = pd.DataFrame({'UserId': 1.,'ItemId': [48081576,48081576,48081576],'Count': [1,1,2]})
----> 2 recommender4.add_users(new_user_count)
3 recs = recommender4.topN(user = 1, n=k, exclude_seen = False) # think about excluding seen
/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in add_users(self, counts_df, user_df, maxiter, stop_thr, ncores, random_seed)
1832 elif (user_df is None) and (counts_df is not None):
1833
-> 1834 counts_df, new_user_mapping = self._process_extra_df(counts_df, ttl='counts_df')
1835 counts_df['UserId'] -= self.nusers
1836 new_max_id = int(counts_df.UserId.max() + 1)
/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in _process_extra_df(self, df, ttl, df2)
793 warnings.warn(msg)
794 else:
--> 795 raise ValueError("'" + ttl + "' must contain " + subj2 + "s from the training set.")
796
797 new_ids1 = df[col1].unique()
ValueError: 'counts_df' must contain items from the training set.
So am I right to assume that I should have missing_items='include'
in the CTPF call?
Yes, if you want to make predictions about them, you should pass missing_items='include'
.
Thanks very much for your help
I am still having issues with the out-of-matrix prediction. After passing missing_items='include'
:
recommender4 = CTPF(k = 100, missing_items = 'include', random_seed = 123, ncores = -1)
recommender4.fit(counts_df=user_counts_train, words_df=word_counts, val_set=user_counts_validation) # with validation set
Then attempting to add a new user with:
new_user_count = pd.DataFrame({'UserId': -1,'ItemId': [48081576,48081576,48081576],'Count': [1,1,2]})
recommender4.add_users(new_user_count)
I'm presented with the log:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-6b2201af57f5> in <module>()
1 new_user_count = pd.DataFrame({'UserId': -1,'ItemId': [48081576,48081576,48081576],'Count': [1,1,2]})
----> 2 recommender4.add_users(new_user_count)
3 # recs = recommender4.topN(user = 1, n=k, exclude_seen = False) # think about excluding seen
/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in add_users(self, counts_df, user_df, maxiter, stop_thr, ncores, random_seed)
1832 elif (user_df is None) and (counts_df is not None):
1833
-> 1834 counts_df, new_user_mapping = self._process_extra_df(counts_df, ttl='counts_df')
1835 counts_df['UserId'] -= self.nusers
1836 new_max_id = int(counts_df.UserId.max() + 1)
/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in _process_extra_df(self, df, ttl, df2)
793 warnings.warn(msg)
794 else:
--> 795 raise ValueError("'" + ttl + "' must contain " + subj2 + "s from the training set.")
796
797 new_ids1 = df[col1].unique()
ValueError: 'counts_df' must contain items from the training set.
The ItemId [48081576] isn't in the user_counts_train
data frame, but is in the word_counts
data frame.
I'm unable to reproduce. The following runs correctly on my machine:
import numpy as np, pandas as pd
from ctpfrec import CTPF
words_count = pd.DataFrame({
"ItemId" : [1,2,3,4],
"WordId" : [1,1,2,2],
"Count" : [3,3,3,3]
})
counts_train = pd.DataFrame({
"UserId" : [1,2,3,4],
"ItemId" : [1,1,2,2],
"Count" : [3,3,3,3]
})
counts_val = pd.DataFrame({
"UserId" : [1,2,3,4],
"ItemId" : [3,3,3,3],
"Count" : [3,3,3,3]
})
new_user_df = pd.DataFrame({
"UserId" : [5],
"ItemId" : [4],
"Count" : [3]
})
model = CTPF(k = 2, missing_items = 'include', verbose=False)
model.fit(counts_df=counts_train, words_df=words_count, val_set=counts_val)
model.add_users(counts_df=new_user_df)
It appears that ctpfrec is unable to make out-of-matrix prediction, i.e. it can't recommend items without any ratings/clicks/plays/etc.
You did ask my to upload a toy dataset to show you which I am having trouble doing. I am also unable to upload the datasets I am using due to GDPR.
It is however very simple: I have three sets (in the required pandas triplet form {"UserId" : , "ItemId" : , "Count" : }) of user click data
user_counts_train
,user_counts_validation
anduser_counts_test
, and another setword_counts
for the items (in the required pandas triplet form {"ItemId" : , "WordId" : , "Count" : }). Importantly, there are no items in the three user sets that aren't in the word_counts set.I fit my model using the training and validation sets:
The issue is when I attempt to make an out-of-matrix prediction using an item that appears only in the
user_counts_test
andword_counts
sets via:Is the issue with ctpfrec itself, or the way I am attempting to add a new user history and make predictions with
topN
?Thank you