RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.4k stars 612 forks source link

[Question] How I can access user-items predictions and actuals? #1032

Closed AsyaEvloeva closed 2 years ago

AsyaEvloeva commented 2 years ago

I'm trying to understand how to get all users from my test dataset with the corresponding predicted items per each user? I use RecVAE model and when I get top items like that:

def run_predict(model, test_data, device, k):
    item_tensor = test_data.dataset.get_item_feature().to(device).repeat(test_data.step)
    tot_item_num = test_data.dataset.item_num
    topk_list = []
    for batched_data in test_data:
        interaction, history_index, *_ = batched_data
        try:
            scores = model.full_sort_predict(interaction.to(device))
        except NotImplementedError:
            new_inter = interaction.to(
                device).repeat_interleave(tot_item_num)
            batch_size = len(new_inter)
            new_inter.update(item_tensor[:batch_size])
            scores = model.predict(new_inter)
        scores = scores.view(-1, tot_item_num)
        scores[:, 0] = -np.inf
        if history_index is not None:
            scores[history_index] = -np.inf
        topk_list.append(torch.topk(scores, k=k)[1])
    topk_items = torch.cat(topk_list, dim=0).unique()
    topk_items = topk_items.cpu().numpy()

    return topk_items

it returns just items which doesn't correspond to number of users I have in test_data.dataset.user_num So I'm a little lost, how would I see user per each of these topk_items predictions? And how to access initial items for each user from test data?

Sherry-XLL commented 2 years ago

@AsyaEvloeva Hello, thanks for your attention to RecBole! In RecBole, we have implemented an in-depth study of the performance of a specific recommendation algorithm, which will analysis the recommendation result of some users in case_study.py. You can refer to our documentation about case study for more details and refer to case_study_example.py in run_example for usage.

https://github.com/RUCAIBox/RecBole/blob/99bc8e60b61f7e0e51df3fea599c7a72fe7d4750/run_example/case_study_example.py#L13-L36

As for the test_data.dataset.user_num, take ml-100k for example, the first five lines of ml-100k.inter is as follows:

user_id:token item_id:token rating:float timestamp:float
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596

196 and 186 are external user ids, and they are the identification of users in the dataset. In RecBole, the user ids will be remapped to continuous sequence in recbole.data.dataset. Therefore, 196 is remapped to 1, 186 is remapped to 2. We add [PAD] for all the token like fields. Thus after remapping ID, 0 will be reserved for [PAD], which makes the result of Dataset.item_num more than the actual number. In this example, 1 and 2 are internal user ids.

Therefore, if you want to get the predicted items of all users with the pre-trained RecVAE, you only need to change the model_file and uid_series in case_study_example.py as follows, and run this file. What's more, you can refer to case_study.py for more details about the implementation of predictions.

    config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
        model_file='./saved/RecVAE-Nov-02-2021_22-51-47.pth',
    )  # Here you can replace it by your model path.
    uid_series = np.arange(1, test_data.dataset.user_num)

In addition, if you want to access initial items for each user from test data, you can just write codes like this. I hope my answer can be helpful to you.

history_item = test_data.uid2history_item[list(uid_series)]
AsyaEvloeva commented 2 years ago

Thank you for the reply! I was trying to do what you suggested and it works perfectly well on ml-100k dataset Though when I am trying to do the same with my dataset it throws this error:

    raise ValueError(f'token [{tokens}] is not existed in {field}')
ValueError: token [1] is not existed in user_id

with this code:

from recbole.quick_start import run_recbole, load_data_and_model
import os
import glob

folder = './saved_dir'
mydataset = 'mydataset'

files = glob.glob(folder+'/*')
for f in files:
    os.remove(f)

def save_example():
    # configurations initialization
    config_dict = {
        'data_path': './',
        'dataset': mydataset,
        'checkpoint_dir': folder,
        'save_dataset': True,
        'save_dataloaders': True,
        'user_inter_num_interval': "[30,inf)",
        'item_inter_num_interval': "[30,inf)",
        'USER_ID_FIELD':'user_id',
        'ITEM_ID_FIELD':'item_id',
        'TIME_FIELD':'timestamp',
        'RATING_FIELD':'rate',
        'load_col': {'inter': ['user_id', 'item_id', 'rate', 'timestamp']},
    }
    run_recbole(model='RecVAE', dataset=mydataset, config_dict=config_dict)

save_example()

model_path = glob.glob(folder + '/RecVAE*')[0]
print('model_path:', model_path)
# model_path: ./saved_dir/RecVAE-Nov-03-2021_05-01-22.pth

# Filtered dataset and split dataloaders are created according to 'config'.
config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
    model_file=model_path,
)

print(config)
print(model)

print('user_ids:', test_data.dataset['user_id'])
# user_ids: tensor([  1,   1,   1,  ..., 241, 241, 241])
outer_ids = list(test_data.dataset['user_id'].cpu().numpy())
print('first id:', [str(outer_ids[0])])
# first id: ['1']

# or you can use dataset.token2id to transfer external user token to internal user id
uid_series = test_data.dataset.token2id(test_data.dataset.uid_field, [str(outer_ids[0])])
print('uid_series:', uid_series)

# raise ValueError(f'token [{tokens}] is not existed in {field}')
# ValueError: token [1] is not existed in user_id

these are how my dataset looks like:

mydataset.inter

user_id:token item_id:token price:token rate:token timestamp:float
69701 1130826 9.219860106306552 436 1634850000.0
21543 1130809 9.219860106306552 436 1634850000.0

mydataset.item

item_id:token collection:token source:token_seq
1130826 3748 WEB
1130809 3748 WEB

mydataset.user

user_id:token
21543
64158

So I don't really understand why token [1] is not existed in user_id, because 1 is definitely in test_data.dataset['user_id']list

Sherry-XLL commented 2 years ago

@AsyaEvloeva Hello, you may confuse external user token with internal user id. In your dataset, 69701 and 21543 are external user ids, also named as external user tokens. In RecBole, the external user ids will be remapped to continuous sequence in recbole.data.dataset, and the remapped ids are internal user ids. In order to convert between the two, we can use dataset.token2id to transfer external user token to internal user id or dataset.id2token to transfer internal user id to external user token. Since users in test_data.dataset['user_id'] are remapped ids, you don't need to transfer them anymore. In other words, uid_series=test_data.dataset['user_id'], and token2id should no longer be used.

user_id:token item_id:token price:token rate:token timestamp:float
69701 1130826 9.219860106306552 436 1634850000.0
21543 1130809 9.219860106306552 436 1634850000.0

For simplicity, you can also get internal user id series uid_series like this: uid_series = np.array([1, 2]) # internal user id series. In your dataset, 69701 is remapped to 1, 21543 is remapped to 2. We add [PAD] for all the token like fields. Thus after remapping ID, 0 will be reserved for [PAD]. 1 and 2 are internal user ids, while 69701 and 21543 are tokens. Because 1 is an internal user id rather than a token, your code will report an error.

All in all, there are two ways to get uid_series.

AsyaEvloeva commented 2 years ago

so the right way to access all external ids of just test_data is this, right? :

internal_ids = list(np.unique(test_data.dataset['user_id'].cpu().numpy()))
external_ids = dataset.id2token(dataset.uid_field, internal_ids)
print('external_ids:', external_ids)

when I do: uid_series = dataset.token2id(dataset.uid_field, ['69701', '21543']) I receive:

    raise ValueError(f'token [{tokens}] is not existed in {field}')
ValueError: token [69701] is not existed in user_id

though I have 69701 in my initial atomic files, but not in my dataset['user_id']

Sorry I can't find reference of what does [PAD] stands for?

Sherry-XLL commented 2 years ago

@AsyaEvloeva Atomic files are unprocessed raw input dataset, while we provide many useful functions that support a series of preprocessing functions in recommender systems, such as k-core data filtering and missing value imputation. I'm not sure if any user_id has been filtered out, you can try to output token2id and id2token as follows:

print(dataset.field2token_id[dataset.uid_field])
print(dataset.field2id_token[dataset.uid_field])

Maybe 69701 is not in the keys of dataset.field2token_id[dataset.uid_field].

As for [PAD], we add [PAD] for all the token-like fields because 0 is always PADDING for token-like features. For example, if test is token-like feature, token_a is remapped to 1, token_b is remapped to 2. Then field2id_token['test'] = ['[PAD]', 'token_a', 'token_b'], and field2token_id['test'] = {'[PAD]': 0, 'token_a': 1, 'token_b': 2}.

AsyaEvloeva commented 2 years ago

thank you for reply! yes I think some user_id has been filtered out, I just don't know how to go about that. As I understood when I add configurations like that: 'val_interval': {'rating': "[4,inf)"}, 'train_interval': {'rating': "[4,inf)"}, or apply k-core filtering, then I get an error that token is not existed in user_id

here is what I'm trying to do (on ml-100k dataset):

# Filtered dataset and split dataloaders are created according to 'config'.
config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
    model_file=model_path,
)

# internal users all
internal_ids = list(dataset['user_id'].cpu().numpy())
print('internal_ids len:', len(internal_ids))
# internal users test
internal_ids_test = list(test_data.dataset['user_id'].cpu().numpy())
print('internal_ids_test len:', len(internal_ids_test))

# id2token: internal to external users all
external_ids = dataset.id2token(dataset.uid_field, internal_ids)
print('external_ids len:', len(external_ids))
# id2token: internal to external users test
external_ids_test = dataset.id2token(dataset.uid_field, internal_ids_test)
print('external_ids_test len:', len(external_ids_test))

# token2id: external to internal users all
uid_series = dataset.token2id(dataset.uid_field, external_ids)
print('uid_series len', len(uid_series))
# token2id: external to internal users test
uid_series_test = dataset.token2id(dataset.uid_field, external_ids_test)
print('uid_series_test len', len(uid_series_test))

# internal items all
internal_items = list(dataset['item_id'].cpu().numpy())
print('internal_items len:', len(internal_items))
# internal items test
internal_items_test = list(test_data.dataset['item_id'].cpu().numpy())
print('internal_items_test len:', len(internal_items_test))

# id2token: internal to external users all
external_items = dataset.id2token(dataset.uid_field, internal_ids)
print('external_items len:', len(external_items), external_items.shape)
# id2token: internal to external users test
external_items_test = dataset.id2token(dataset.uid_field, internal_ids_test)
print('external_items_test len:', len(external_items_test), external_items_test.shape)

# token2id: external to internal users all
uid_series_items = dataset.token2id(dataset.uid_field, external_ids)
print('uid_series_items len:', len(uid_series_items))
# token2id: external to internal users test
uid_series_items_test = dataset.token2id(dataset.uid_field, external_ids_test)
print('uid_series_items_test len:', len(uid_series_items_test))

# predicted internal id of top 10 items
topk_score, predicted_topk_iid_list  = full_sort_topk(uid_series_test, model, test_data, k=10, device=config['device'])
print('topk_iid_list:', len(predicted_topk_iid_list))
predicted_items = [list(map(int, i))  for i in predicted_topk_iid_list]
predicted_items = [list(map(str, i))  for i in predicted_items]
print('predicted_items:', len(predicted_items))

# external id of top 10 predicted internal items
predicted_items_reverted = [dataset.token2id(test_data.dataset.iid_field, i)  for i in predicted_items]
print('predicted_items_reverted:', len(predicted_items_reverted))

so when I use 'val_interval': {'rating': "[4,inf)"}, 'train_interval': {'rating': "[4,inf)"}, in cofig, i got error on line predicted_items_reverted = [dataset.token2id(test_data.dataset.iid_field, i) for i in predicted_items] that ValueError: token [599] is not existed in item_id, but when I don't use any filtering, I don't have any error there I'm not sure how can I fix that? have I done any mistake while was converting ?

Sherry-XLL commented 2 years ago

@AsyaEvloeva As for full_sort_topk, the returned predicted_topk_iid_list is the index of topk items, which is also the internal ids of items. So you should use predicted_items_reverted = [dataset.id2token(test_data.dataset.iid_field, i) for i in predicted_items] to get the external ids of top 10 predicted items. You used token2id by mistake so an error occurred.

https://github.com/RUCAIBox/RecBole/blob/7bd3df81edf5554ef65172601f37ef629291d7d5/recbole/utils/case_study.py#L72-L92

AsyaEvloeva commented 2 years ago

Sorry could you please give an example how would I convert internal predicted items to external for each user? I guess I understood how to convert users' ids, but not sure if i'm doing items convertions right way

1) I have troubles at this point on example of movies dataset:

The number of users: 944
Average actions of users: 106.04453870625663
The number of items: 1683
Average actions of items: 59.45303210463734
The number of inters: 100000

there are 1683 items (where fist item is just padding (?), so when I receive scores of all the items that I have from full_sort_scores function, I can get rid of one element as it's no real item there (?) as there are actually 1682 items not 1683) and then what would be a correct conversion to external items?

2) I'm also struggling understanding how would I predict items for new users? for example i already trained my model on users 1,2,3,4 and now want to predict items for user 1 (which is was in train set) and user 5 (which is a new user)

input_inter = Interaction({
    'user_id': torch.tensor([1, 100]),
    'item_id_list': torch.tensor([[1, 2, 3, 0, 0],
                                  [4, 5, 0, 0, 0]])
})

and when I'm trying to do

with torch.no_grad():
    scores = model.full_sort_predict(input_inter)

I'm getting IndexError: index 100 is out of bounds for dimension 0 with size 5

how would I fix that?

Sherry-XLL commented 2 years ago

@AsyaEvloeva In your example, you just need to use id2token correctly and you can get the predicted external items as follows:

topk_score, topk_iid_list = full_sort_topk(uid_series, model, test_data, k=10, device=config['device'])
print(topk_score)  # scores of top 10 items
print(topk_iid_list)  # internal id of top 10 items
external_item_list = dataset.id2token(dataset.iid_field, topk_iid_list.cpu())
print(external_item_list)  # external tokens of top 10 items

As for your second question, we can't provide reliable recommendations for new users without any interaction records. Cold start is an ongoing problem in the recommendation system. This is intuitive, we can't provide personalized recommendations for a new user without any information. For models that use user embeddings, you can only use users who have appeared in the training set.

Towards that, later studies like SVD++, FISM, NAIS, and ACF view personal history of a user as her features, and integrate embeddings of historical items via the average or attention network as user embeddings. Another similar line applies autoencoders on interaction histories to estimate the generative process of user behaviors, such as Mult-VAE, AutoRec, CDAE and RecVAE.

In this case, we don't need user ID, and we only need to recommend items based on historical interaction. However, RecBole doesn't have an existing interface to provide item list recommendation for new users, so you need to write your own code according to your needs.

For example, in the AE-based models, you need to modify rating_matrix during prediction, we should use the item list directly instead of self.history_item_id[user] to get a batch of user's feature with the item_id_list.

AsyaEvloeva commented 2 years ago

thank you for this example!! now i get it😁

I'm looking not for a cold start problem solution, but more about recommendations for new users with provided interaction records. so if i'm using RecVAE model, is there any already-built possibility to update dataset with such new records and continue training model and making predictions on such updated dataset?

Sherry-XLL commented 2 years ago

@AsyaEvloeva Sorry, we don't have such an interface at present.