Closed AsyaEvloeva closed 2 years ago
@AsyaEvloeva Hello, thanks for your attention to RecBole! In RecBole, we have implemented an in-depth study of the performance of a specific recommendation algorithm, which will analysis the recommendation result of some users in case_study.py
. You can refer to our documentation about case study for more details and refer to case_study_example.py
in run_example
for usage.
As for the test_data.dataset.user_num
, take ml-100k
for example, the first five lines of ml-100k.inter
is as follows:
user_id:token | item_id:token | rating:float | timestamp:float |
---|---|---|---|
196 | 242 | 3 | 881250949 |
186 | 302 | 3 | 891717742 |
22 | 377 | 1 | 878887116 |
244 | 51 | 2 | 880606923 |
166 | 346 | 1 | 886397596 |
196
and 186
are external user ids, and they are the identification of users in the dataset. In RecBole, the user ids will be remapped to continuous sequence in recbole.data.dataset
. Therefore, 196
is remapped to 1
, 186
is remapped to 2
. We add [PAD]
for all the token like fields. Thus after remapping ID, 0
will be reserved for [PAD]
, which makes the result of Dataset.item_num
more than the actual number. In this example, 1
and 2
are internal user ids.
Therefore, if you want to get the predicted items of all users with the pre-trained RecVAE, you only need to change the model_file and uid_series in case_study_example.py
as follows, and run this file. What's more, you can refer to case_study.py for more details about the implementation of predictions.
config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
model_file='./saved/RecVAE-Nov-02-2021_22-51-47.pth',
) # Here you can replace it by your model path.
uid_series = np.arange(1, test_data.dataset.user_num)
In addition, if you want to access initial items for each user from test data, you can just write codes like this. I hope my answer can be helpful to you.
history_item = test_data.uid2history_item[list(uid_series)]
Thank you for the reply! I was trying to do what you suggested and it works perfectly well on ml-100k
dataset
Though when I am trying to do the same with my dataset it throws this error:
raise ValueError(f'token [{tokens}] is not existed in {field}')
ValueError: token [1] is not existed in user_id
with this code:
from recbole.quick_start import run_recbole, load_data_and_model
import os
import glob
folder = './saved_dir'
mydataset = 'mydataset'
files = glob.glob(folder+'/*')
for f in files:
os.remove(f)
def save_example():
# configurations initialization
config_dict = {
'data_path': './',
'dataset': mydataset,
'checkpoint_dir': folder,
'save_dataset': True,
'save_dataloaders': True,
'user_inter_num_interval': "[30,inf)",
'item_inter_num_interval': "[30,inf)",
'USER_ID_FIELD':'user_id',
'ITEM_ID_FIELD':'item_id',
'TIME_FIELD':'timestamp',
'RATING_FIELD':'rate',
'load_col': {'inter': ['user_id', 'item_id', 'rate', 'timestamp']},
}
run_recbole(model='RecVAE', dataset=mydataset, config_dict=config_dict)
save_example()
model_path = glob.glob(folder + '/RecVAE*')[0]
print('model_path:', model_path)
# model_path: ./saved_dir/RecVAE-Nov-03-2021_05-01-22.pth
# Filtered dataset and split dataloaders are created according to 'config'.
config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
model_file=model_path,
)
print(config)
print(model)
print('user_ids:', test_data.dataset['user_id'])
# user_ids: tensor([ 1, 1, 1, ..., 241, 241, 241])
outer_ids = list(test_data.dataset['user_id'].cpu().numpy())
print('first id:', [str(outer_ids[0])])
# first id: ['1']
# or you can use dataset.token2id to transfer external user token to internal user id
uid_series = test_data.dataset.token2id(test_data.dataset.uid_field, [str(outer_ids[0])])
print('uid_series:', uid_series)
# raise ValueError(f'token [{tokens}] is not existed in {field}')
# ValueError: token [1] is not existed in user_id
these are how my dataset looks like:
mydataset.inter
user_id:token | item_id:token | price:token | rate:token | timestamp:float |
---|---|---|---|---|
69701 | 1130826 | 9.219860106306552 | 436 | 1634850000.0 |
21543 | 1130809 | 9.219860106306552 | 436 | 1634850000.0 |
mydataset.item
item_id:token | collection:token | source:token_seq |
---|---|---|
1130826 | 3748 | WEB |
1130809 | 3748 | WEB |
mydataset.user
user_id:token |
---|
21543 |
64158 |
So I don't really understand why token [1] is not existed in user_id
, because 1
is definitely in test_data.dataset['user_id']
list
@AsyaEvloeva Hello, you may confuse external user token with internal user id. In your dataset, 69701
and 21543
are external user ids, also named as external user tokens. In RecBole, the external user ids will be remapped to continuous sequence in recbole.data.dataset
, and the remapped ids are internal user ids. In order to convert between the two, we can use dataset.token2id
to transfer external user token to internal user id or dataset.id2token
to transfer internal user id to external user token. Since users in test_data.dataset['user_id']
are remapped ids, you don't need to transfer them anymore. In other words, uid_series=test_data.dataset['user_id']
, and token2id
should no longer be used.
user_id:token | item_id:token | price:token | rate:token | timestamp:float |
---|---|---|---|---|
69701 | 1130826 | 9.219860106306552 | 436 | 1634850000.0 |
21543 | 1130809 | 9.219860106306552 | 436 | 1634850000.0 |
For simplicity, you can also get internal user id series uid_series
like this: uid_series = np.array([1, 2]) # internal user id series
. In your dataset, 69701
is remapped to 1
, 21543
is remapped to 2
. We add [PAD]
for all the token like fields. Thus after remapping ID, 0
will be reserved for [PAD]
. 1
and 2
are internal user ids, while 69701
and 21543
are tokens. Because 1
is an internal user id rather than a token, your code will report an error.
All in all, there are two ways to get uid_series
.
uid_series = np.array([1, 2])
# get internal user id series directlyuid_series = dataset.token2id(dataset.uid_field, ['69701', '21543'])
# or you can use dataset.token2id to transfer external user token to internal user id so the right way to access all external ids of just test_data
is this, right? :
internal_ids = list(np.unique(test_data.dataset['user_id'].cpu().numpy()))
external_ids = dataset.id2token(dataset.uid_field, internal_ids)
print('external_ids:', external_ids)
when I do:
uid_series = dataset.token2id(dataset.uid_field, ['69701', '21543'])
I receive:
raise ValueError(f'token [{tokens}] is not existed in {field}')
ValueError: token [69701] is not existed in user_id
though I have 69701 in my initial atomic files, but not in my dataset['user_id']
Sorry I can't find reference of what does [PAD]
stands for?
@AsyaEvloeva Atomic files are unprocessed raw input dataset, while we provide many useful functions that support a series of preprocessing functions in recommender systems, such as k-core data filtering and missing value imputation. I'm not sure if any user_id
has been filtered out, you can try to output token2id
and id2token
as follows:
print(dataset.field2token_id[dataset.uid_field])
print(dataset.field2id_token[dataset.uid_field])
Maybe 69701
is not in the keys of dataset.field2token_id[dataset.uid_field]
.
As for [PAD]
, we add [PAD]
for all the token-like fields because 0
is always PADDING for token-like features. For example, if test
is token-like feature, token_a
is remapped to 1
, token_b
is remapped to 2
. Then field2id_token['test'] = ['[PAD]', 'token_a', 'token_b'], and field2token_id['test'] = {'[PAD]': 0, 'token_a': 1, 'token_b': 2}.
thank you for reply! yes I think some user_id
has been filtered out, I just don't know how to go about that. As I understood when I add configurations like that: 'val_interval': {'rating': "[4,inf)"}, 'train_interval': {'rating': "[4,inf)"},
or apply k-core filtering, then I get an error that token is not existed in user_id
here is what I'm trying to do (on ml-100k dataset):
# Filtered dataset and split dataloaders are created according to 'config'.
config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
model_file=model_path,
)
# internal users all
internal_ids = list(dataset['user_id'].cpu().numpy())
print('internal_ids len:', len(internal_ids))
# internal users test
internal_ids_test = list(test_data.dataset['user_id'].cpu().numpy())
print('internal_ids_test len:', len(internal_ids_test))
# id2token: internal to external users all
external_ids = dataset.id2token(dataset.uid_field, internal_ids)
print('external_ids len:', len(external_ids))
# id2token: internal to external users test
external_ids_test = dataset.id2token(dataset.uid_field, internal_ids_test)
print('external_ids_test len:', len(external_ids_test))
# token2id: external to internal users all
uid_series = dataset.token2id(dataset.uid_field, external_ids)
print('uid_series len', len(uid_series))
# token2id: external to internal users test
uid_series_test = dataset.token2id(dataset.uid_field, external_ids_test)
print('uid_series_test len', len(uid_series_test))
# internal items all
internal_items = list(dataset['item_id'].cpu().numpy())
print('internal_items len:', len(internal_items))
# internal items test
internal_items_test = list(test_data.dataset['item_id'].cpu().numpy())
print('internal_items_test len:', len(internal_items_test))
# id2token: internal to external users all
external_items = dataset.id2token(dataset.uid_field, internal_ids)
print('external_items len:', len(external_items), external_items.shape)
# id2token: internal to external users test
external_items_test = dataset.id2token(dataset.uid_field, internal_ids_test)
print('external_items_test len:', len(external_items_test), external_items_test.shape)
# token2id: external to internal users all
uid_series_items = dataset.token2id(dataset.uid_field, external_ids)
print('uid_series_items len:', len(uid_series_items))
# token2id: external to internal users test
uid_series_items_test = dataset.token2id(dataset.uid_field, external_ids_test)
print('uid_series_items_test len:', len(uid_series_items_test))
# predicted internal id of top 10 items
topk_score, predicted_topk_iid_list = full_sort_topk(uid_series_test, model, test_data, k=10, device=config['device'])
print('topk_iid_list:', len(predicted_topk_iid_list))
predicted_items = [list(map(int, i)) for i in predicted_topk_iid_list]
predicted_items = [list(map(str, i)) for i in predicted_items]
print('predicted_items:', len(predicted_items))
# external id of top 10 predicted internal items
predicted_items_reverted = [dataset.token2id(test_data.dataset.iid_field, i) for i in predicted_items]
print('predicted_items_reverted:', len(predicted_items_reverted))
so when I use 'val_interval': {'rating': "[4,inf)"}, 'train_interval': {'rating': "[4,inf)"},
in cofig, i got error on line predicted_items_reverted = [dataset.token2id(test_data.dataset.iid_field, i) for i in predicted_items]
that ValueError: token [599] is not existed in item_id
, but when I don't use any filtering, I don't have any error there
I'm not sure how can I fix that? have I done any mistake while was converting ?
@AsyaEvloeva As for full_sort_topk
, the returned predicted_topk_iid_list
is the index of topk items, which is also the internal ids of items. So you should use predicted_items_reverted = [dataset.id2token(test_data.dataset.iid_field, i) for i in predicted_items]
to get the external ids of top 10 predicted items. You used token2id
by mistake so an error occurred.
Sorry could you please give an example how would I convert internal predicted items to external for each user? I guess I understood how to convert users' ids, but not sure if i'm doing items convertions right way
1) I have troubles at this point on example of movies dataset:
The number of users: 944
Average actions of users: 106.04453870625663
The number of items: 1683
Average actions of items: 59.45303210463734
The number of inters: 100000
there are 1683 items (where fist item is just padding (?), so when I receive scores of all the items that I have from full_sort_scores
function, I can get rid of one element as it's no real item there (?) as there are actually 1682 items not 1683) and then what would be a correct conversion to external items?
2) I'm also struggling understanding how would I predict items for new users? for example i already trained my model on users 1,2,3,4 and now want to predict items for user 1 (which is was in train set) and user 5 (which is a new user)
input_inter = Interaction({
'user_id': torch.tensor([1, 100]),
'item_id_list': torch.tensor([[1, 2, 3, 0, 0],
[4, 5, 0, 0, 0]])
})
and when I'm trying to do
with torch.no_grad():
scores = model.full_sort_predict(input_inter)
I'm getting IndexError: index 100 is out of bounds for dimension 0 with size 5
how would I fix that?
@AsyaEvloeva In your example, you just need to use id2token
correctly and you can get the predicted external items as follows:
topk_score, topk_iid_list = full_sort_topk(uid_series, model, test_data, k=10, device=config['device'])
print(topk_score) # scores of top 10 items
print(topk_iid_list) # internal id of top 10 items
external_item_list = dataset.id2token(dataset.iid_field, topk_iid_list.cpu())
print(external_item_list) # external tokens of top 10 items
As for your second question, we can't provide reliable recommendations for new users without any interaction records. Cold start is an ongoing problem in the recommendation system. This is intuitive, we can't provide personalized recommendations for a new user without any information. For models that use user embeddings
, you can only use users who have appeared in the training set.
Towards that, later studies like SVD++, FISM, NAIS, and ACF view personal history of a user as her features, and integrate embeddings of historical items via the average or attention network as user embeddings. Another similar line applies autoencoders on interaction histories to estimate the generative process of user behaviors, such as Mult-VAE, AutoRec, CDAE and RecVAE.
In this case, we don't need user ID, and we only need to recommend items based on historical interaction. However, RecBole doesn't have an existing interface to provide item list recommendation for new users, so you need to write your own code according to your needs.
For example, in the AE-based models, you need to modify rating_matrix
during prediction, we should use the item list directly instead of self.history_item_id[user]
to get a batch of user's feature with the item_id_list
.
thank you for this example!! now i get it😁
I'm looking not for a cold start problem solution, but more about recommendations for new users with provided interaction records. so if i'm using RecVAE model, is there any already-built possibility to update dataset with such new records and continue training model and making predictions on such updated dataset?
@AsyaEvloeva Sorry, we don't have such an interface at present.
I'm trying to understand how to get all users from my test dataset with the corresponding predicted items per each user? I use RecVAE model and when I get top items like that:
it returns just items which doesn't correspond to number of users I have in
test_data.dataset.user_num
So I'm a little lost, how would I see user per each of thesetopk_items
predictions? And how to access initial items for each user from test data?