RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.34k stars 604 forks source link

Unable to use TO_RS eval setting with sequential model #718

Closed stsl256 closed 2 years ago

stsl256 commented 3 years ago

I want my sequential model to train on the first 75% of ML dataset and test on the rest. The split should be temporal ordered and ratio-based. In the end, it throws an error.

Steps to reproduce:

from recbole.quick_start import run_recbole
run_recbole(model='STAMP', dataset='ml-100k', config_dict={"eval_setting": "TO_RS,full", "split_ratio": [0.75, 0, 0.25]})

returns

17 Feb 16:38    INFO General Hyper Parameters: 
gpu_id=0
use_gpu=True
seed=2020
state=INFO
reproducibility=True
data_path=/home/user/env/lib/python3.8/site-packages/recbole/config/../dataset_example/ml-100k
show_progress=True

Training Hyper Parameters: 
checkpoint_dir=saved
epochs=300
train_batch_size=2048
learner=adam
learning_rate=0.001
training_neg_sample_num=1
training_neg_sample_distribution=uniform
eval_step=1
stopping_step=10

Evaluation Hyper Parameters: 
eval_setting=TO_RS,full
group_by_user=True
split_ratio=[0.75, 0, 0.25]
leave_one_num=2
real_time_process=False
metrics=['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk=[10]
valid_metric=MRR@10
eval_batch_size=4096

Dataset Hyper Parameters: 
field_separator=    
seq_separator= 
USER_ID_FIELD=user_id
ITEM_ID_FIELD=item_id
RATING_FIELD=rating
TIME_FIELD=timestamp
seq_len=None
LABEL_FIELD=label
threshold=None
NEG_PREFIX=neg_
load_col={'inter': ['user_id', 'item_id', 'rating', 'timestamp']}
unload_col=None
unused_col=None
additional_feat_suffix=None
rm_dup_inter=None
lowest_val=None
highest_val=None
equal_val=None
not_equal_val=None
filter_inter_by_user_or_item=True
max_user_inter_num=None
min_user_inter_num=None
max_item_inter_num=None
min_item_inter_num=None
fields_in_same_space=None
preload_weight=None
normalize_field=None
normalize_all=True
ITEM_LIST_LENGTH_FIELD=item_length
LIST_SUFFIX=_list
MAX_ITEM_LIST_LENGTH=50
POSITION_FIELD=position_id
HEAD_ENTITY_ID_FIELD=head_id
TAIL_ENTITY_ID_FIELD=tail_id
RELATION_ID_FIELD=relation_id
ENTITY_ID_FIELD=entity_id

17 Feb 16:38    INFO ml-100k
The number of users: 944
Average actions of users: 106.04453870625663
The number of items: 1683
Average actions of items: 59.45303210463734
The number of inters: 100000
The sparsity of the dataset: 93.70575143257098%
Remain Fields: ['user_id', 'item_id', 'rating', 'timestamp']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-4085842e7947> in <module>
      1 from recbole.quick_start import run_recbole
----> 2 run_recbole(model='SHAN', dataset='ml-100k', config_dict={"eval_setting": "TO_RS,full", "split_ratio": [0.75, 0, 0.25]})

~/env/lib/python3.8/site-packages/recbole/quick_start/quick_start.py in run_recbole(model, dataset, config_file_list, config_dict, saved)
     41 
     42     # dataset splitting
---> 43     train_data, valid_data, test_data = data_preparation(config, dataset)
     44 
     45     # model loading and initialization

~/env/lib/python3.8/site-packages/recbole/data/utils.py in data_preparation(config, dataset, save)
     77 
     78     built_datasets = dataset.build(es)
---> 79     train_dataset, valid_dataset, test_dataset = built_datasets
     80     phases = ['train', 'valid', 'test']
     81     sampler = None

TypeError: cannot unpack non-iterable NoneType object

I thought this should work. I tried same script with SASRec and SHAN (both sequential), same result. It works though with general NeuMF. I've got a feeling that I don't understand sth, however nothing related is mentioned in the documentation...

chenyushuo commented 3 years ago

Sorry, we can only support TO_LS eval setting with sequential model currently, TO_RS is not supported. We will consider it in the later version.

rowedenny commented 3 years ago

@chenyushuo I am wondering if this enhancement has been assigned to any collaborator, if not I would be happy to take it.

In general, there are multiple temporal associated data partitions: 1) for session-based recommendation, a ratio based data partition could be: take the earliest 80% sessions as training, subsequent 10% as validation and most recent 10% as testing; I assume this fits the TO_RS eval setting.

2) for sequential-based recommendation, a cut-off time-based data partition could be: given a cut-off time, all the interactions before are treated as training, the first interaction after the timestamp for validation, and the second for testing. I have no idea which eval fit this case.

Would you mind clarifying which one(s) we would like to implement?

2017pxy commented 3 years ago

@rowedenny Firstly, thanks so much for your enthusiasm for making contributions to RecBole. However, because the implemention of this feature is not easy (you may need to rebuild dataset, dataloader and so on) and the details about the implemention still need to be discussed by our team, we prefer to develop this feature by ourselves. Thanks again for your close attention to RecBole. Any suggestion by issue or improvement by PR are welcome!

rowedenny commented 3 years ago

I assume we can directly reimplement split_by_ratio to make TO_RS, and then make a tiny modification on function build accordingly. See the following code:

def split_by_ratio(self, ratios, group_by=None):
    """Split interaction records by ratios according to user_id
    Args:
        ratios (list): List of split ratios. No need to be normalized.
        group_by (str, optional): Field name that interaction records should grouped by before splitting.
            Defaults to ``None``
    Returns:
        list: List of :class:`~Dataset`, whose interaction features has been split.
    Note:
        Other than the first one, each part is rounded down.
    """
    self.logger.debug(f'split by ratios [{ratios}], group_by=[{group_by}]')
    if group_by is None:
        raise ValueError('Leave one out strategy require a group field.')
    if group_by != self.uid_field:
        raise ValueError('Sequential models random split require group by user.')

    tot_ratio = sum(ratios)
    ratios = [_ / tot_ratio for _ in ratios]

    self.prepare_data_augmentation()
    grouped_index = list(self._grouped_index(self.uid_list))
    tot_cnt = len(grouped_index)
    split_ids = self._calcu_split_ids(tot=tot_cnt, ratios=ratios)

    next_index = [[] for _ in range(len(ratios))]
    next_ds = []
    for index, start, end in zip(next_index, [0] + split_ids, split_ids + [tot_cnt]):
        for id in range(start, end):
            index.extend(grouped_index[id])
        print(index[0], index[-1])

        ds = copy.copy(self)
        for field in ['uid_list', 'item_list_index', 'target_index', 'item_list_length']:
            setattr(ds, field, np.array(getattr(ds, field)[index]))
        setattr(ds, 'mask', np.ones(len(self.inter_feat), dtype=np.bool))
        next_ds.append(ds)
    next_ds[0].mask[self.target_index[next_index[1] + next_index[2]]] = False
    next_ds[1].mask[self.target_index[next_index[2]]] = False

    return next_ds
2017pxy commented 3 years ago

@rowedenny Hi! Firstly, thanks for your warm-hearted advice and we have read your code carefully.

If my understand is correct, in your implemention, the data splitting is base on the user number, which is a little different from the split strategy in RecBole. To be more specific, let's take an example:

If the split ratio is [0.8, 0.1, 0.1], and the original data is like this (NOTE: it is not the atomic file, just an example):

UserID  ItemID_list
1   [1,2,3,4,5,6,7,8,9,10]
2   [1,2,3,4,5,6,7,8,9,10]
3   [1,2,3,4,5,6,7,8,9,10]
4   [1,2,3,4,5,6,7,8,9,10]
5   [1,2,3,4,5,6,7,8,9,10]

In your strategy, data will be splitted as:

Train:
1   [1,2,3,4,5,6,7,8,9,10]
2   [1,2,3,4,5,6,7,8,9,10]
3   [1,2,3,4,5,6,7,8,9,10]
Valid:
4   [1,2,3,4,5,6,7,8,9,10]
Test:
5   [1,2,3,4,5,6,7,8,9,10]

In RecBole, data will be splitted as:

Train:
1   [1,2,3,4,5,6,7,8]
2   [1,2,3,4,5,6,7,8]
3   [1,2,3,4,5,6,7,8]
4   [1,2,3,4,5,6,7,8]
5   [1,2,3,4,5,6,7,8]
Valid:
1   [9]
2   [9]
3   [9]
4   [9]
5   [9]
Test:
1   [10]
2   [10]
3   [10]
4   [10]
5   [10]

(I must admit that it's quite hard to describe the difference between them with words, hope you can understand. )

By the way, the problem of data split (by ratio) for sequential models is that: compared with other models, we need to make data augmentation for sequential models. And there are some details need to be discussed about data augmentation and data split (by ratio) like: should the data augmentation done before or after the data split (by ratio)?

That's why we don't support data split (by ratio) for sequential models until now. If you find some referrence (paper or other projects' implemention) about this feature or you have some advice about it, please let us know, thx!

rowedenny commented 3 years ago

Hi @2017pxy, firstly thank you for your time in checking the code. Given the discussion above, here are my replies.

  1. What is the motivation for this function? In real practice, we would like to collect historical behaviors of all users we could have, train a model and then evaluate it on an independent set of users.

  2. Details for the implementation, especially for data augmentation.

    we need to make data augmentation for sequential models. should the data augmentation done before or after the data split (by ratio)?

Yes, I assume we need data augmentation before the split. So it means that the total number of sequence (train + valid + test) will be the same with leave-one-out. You may notice that I firstly call self.prepare_data_augmentation(), to make the data augmentation. Then make the split on augmented self.uid_list instead of self.inter_feat[self.uid_field]. Going back to your example, the training examples for user 1 will not only contain the full sequence but also contain the rest augmented sequences.

By the way, I notice that the data augmentation for the sequential dataset adds a new attribute mask, I guess it is related to the BERT4Rec and S^3Rec, is there any latest doc or related implementation for models?

2017pxy commented 3 years ago

@rowedenny Hi, I think your split strategy is reasonable, we will carefully consider your design and discuss the implementation details, and it may take some time.

About the mask, it has nothing to do with pre-train model. In fact, it's a fix for your PR (#698). I guess you misunderstand the meaning of self.uid_list and we also missed this line when we reviewed your code.

 126       local_inter_feat = self.inter_feat[self.uid_list]  # here the self.uid_list is not the index of inter_feat

Following your design, we create a variable called mask and fix this bug in #760.

rowedenny commented 3 years ago

I see. So please allow me to confirm two things:

  1. self.uid_list is actually a list of all users corresponding to augmented sequences. So the bug from my PR (#698) is, suppose a user with a sequence [1,2,3,4], after the data augmentation, there will be three sequences as training [1], [1,2], [1,2,3], and all the three have the same user id in the self.uid_list. However, if we call local_inter_feat = self.inter_feat[self.uid_list], it will mistakenly fetch THREE independent users, instead of one. Additionally, self.inter_feat[self.uid_list] will get the sequences in valid and test set. Is that correct? and the fixation for this is in #760.
  2. For TO_RS split, the local_inter_feat or mask may be different from leave_one_out split accordingly. Yet in my design, I have not specifically considered it, thus the value mask may have issue. I will take a closer look at it, and replies with an updated version in short time.

Thanks for pointing out my mistake and sorry for the inconvience.

rowedenny commented 3 years ago

I have looked through my implementation of TO_RS, with the fixation of mask , to my best knowledge, it should be correct. Please be free to refer to the design, and please do let me know if there is any issue that I may not be aware.

Thank you!

2017pxy commented 3 years ago

@rowedenny

self.uid_list is actually a list of all users corresponding to augmented sequences. So the bug from my PR (#698) is, suppose a user with a sequence [1,2,3,4], after the data augmentation, there will be three sequences as training [1], [1,2], [1,2,3], and all the three have the same user id in the self.uid_list. However, if we call local_inter_feat = self.inter_feat[self.uid_list], it will mistakenly fetch THREE independent users, instead of one. Additionally, self.inter_feat[self.uid_list] will get the sequences in valid and test set. Is that correct? and the fixation for this is in #760.

I am sorry to tell you that your understanding is not right. The index of self.inter_feat is not organized by uid. Here the self.inter_feat is an interaction. About the interaction, you can see docs of interaction and the code of interaction.

If you read the code, you will find that we rewrite the __getitem__(). That's why we can pick the local_inter_feat by mask. About the mask, it's a bool list and the length is equal to the length of whole inter_feat. It decides which index you can get in each phase(train, valid and test). Let's take the leave-one-out splitting as an example: If the whole sequence is 1,2,3,4,5 and their index is [0,1,2,3,4], we will split it as:

Train:
1,2,3
Valid:
1,2,3,4
Test:
1,2,3,4,5

Then, the mask of each phase will be:

Train:
[1,1,1,0,0]
Valid:
[1,1,1,1,0]
Test:
[1,1,1,1,1]

And in inter_matrix(), we pick the local_inter_feat from the whole inter_feat by mask in each phase, so what you can get depend on your mask.

Hope you can understand.

rowedenny commented 3 years ago

Yes. Thank you for the clarification.

For self.uid_list in prepare_data_augmentation, take the case above as an example. Suppose the sequence is from user u_1, after augmentation, the uid_list would be [1, 1, 1], corresponding item_list_index would be [[1], [1, 2], [1, 2, 3], [1, 2, 3, 4]], so the mistaken in local_inter_feat = self.inter_feat[self.uid_list] is self-clear. However, as you explain, using self.inter_feat[mask] would call the mask from train phase, so this is the correct one.

2017pxy commented 3 years ago

@rowedenny Yes, you are right.