Closed stsl256 closed 2 years ago
Sorry, we can only support TO_LS eval setting with sequential model currently, TO_RS is not supported. We will consider it in the later version.
@chenyushuo I am wondering if this enhancement has been assigned to any collaborator, if not I would be happy to take it.
In general, there are multiple temporal associated data partitions: 1) for session-based recommendation, a ratio based data partition could be: take the earliest 80% sessions as training, subsequent 10% as validation and most recent 10% as testing; I assume this fits the TO_RS eval setting.
2) for sequential-based recommendation, a cut-off time-based data partition could be: given a cut-off time, all the interactions before are treated as training, the first interaction after the timestamp for validation, and the second for testing. I have no idea which eval fit this case.
Would you mind clarifying which one(s) we would like to implement?
@rowedenny Firstly, thanks so much for your enthusiasm for making contributions to RecBole. However, because the implemention of this feature is not easy (you may need to rebuild dataset, dataloader and so on) and the details about the implemention still need to be discussed by our team, we prefer to develop this feature by ourselves. Thanks again for your close attention to RecBole. Any suggestion by issue or improvement by PR are welcome!
I assume we can directly reimplement split_by_ratio
to make TO_RS, and then make a tiny modification on function build
accordingly.
See the following code:
def split_by_ratio(self, ratios, group_by=None):
"""Split interaction records by ratios according to user_id
Args:
ratios (list): List of split ratios. No need to be normalized.
group_by (str, optional): Field name that interaction records should grouped by before splitting.
Defaults to ``None``
Returns:
list: List of :class:`~Dataset`, whose interaction features has been split.
Note:
Other than the first one, each part is rounded down.
"""
self.logger.debug(f'split by ratios [{ratios}], group_by=[{group_by}]')
if group_by is None:
raise ValueError('Leave one out strategy require a group field.')
if group_by != self.uid_field:
raise ValueError('Sequential models random split require group by user.')
tot_ratio = sum(ratios)
ratios = [_ / tot_ratio for _ in ratios]
self.prepare_data_augmentation()
grouped_index = list(self._grouped_index(self.uid_list))
tot_cnt = len(grouped_index)
split_ids = self._calcu_split_ids(tot=tot_cnt, ratios=ratios)
next_index = [[] for _ in range(len(ratios))]
next_ds = []
for index, start, end in zip(next_index, [0] + split_ids, split_ids + [tot_cnt]):
for id in range(start, end):
index.extend(grouped_index[id])
print(index[0], index[-1])
ds = copy.copy(self)
for field in ['uid_list', 'item_list_index', 'target_index', 'item_list_length']:
setattr(ds, field, np.array(getattr(ds, field)[index]))
setattr(ds, 'mask', np.ones(len(self.inter_feat), dtype=np.bool))
next_ds.append(ds)
next_ds[0].mask[self.target_index[next_index[1] + next_index[2]]] = False
next_ds[1].mask[self.target_index[next_index[2]]] = False
return next_ds
@rowedenny Hi! Firstly, thanks for your warm-hearted advice and we have read your code carefully.
If my understand is correct, in your implemention, the data splitting is base on the user number, which is a little different from the split strategy in RecBole. To be more specific, let's take an example:
If the split ratio is [0.8, 0.1, 0.1], and the original data is like this (NOTE: it is not the atomic file, just an example):
UserID ItemID_list
1 [1,2,3,4,5,6,7,8,9,10]
2 [1,2,3,4,5,6,7,8,9,10]
3 [1,2,3,4,5,6,7,8,9,10]
4 [1,2,3,4,5,6,7,8,9,10]
5 [1,2,3,4,5,6,7,8,9,10]
In your strategy, data will be splitted as:
Train:
1 [1,2,3,4,5,6,7,8,9,10]
2 [1,2,3,4,5,6,7,8,9,10]
3 [1,2,3,4,5,6,7,8,9,10]
Valid:
4 [1,2,3,4,5,6,7,8,9,10]
Test:
5 [1,2,3,4,5,6,7,8,9,10]
In RecBole, data will be splitted as:
Train:
1 [1,2,3,4,5,6,7,8]
2 [1,2,3,4,5,6,7,8]
3 [1,2,3,4,5,6,7,8]
4 [1,2,3,4,5,6,7,8]
5 [1,2,3,4,5,6,7,8]
Valid:
1 [9]
2 [9]
3 [9]
4 [9]
5 [9]
Test:
1 [10]
2 [10]
3 [10]
4 [10]
5 [10]
(I must admit that it's quite hard to describe the difference between them with words, hope you can understand. )
By the way, the problem of data split (by ratio) for sequential models is that: compared with other models, we need to make data augmentation for sequential models. And there are some details need to be discussed about data augmentation and data split (by ratio) like: should the data augmentation done before or after the data split (by ratio)?
That's why we don't support data split (by ratio) for sequential models until now. If you find some referrence (paper or other projects' implemention) about this feature or you have some advice about it, please let us know, thx!
Hi @2017pxy, firstly thank you for your time in checking the code. Given the discussion above, here are my replies.
What is the motivation for this function? In real practice, we would like to collect historical behaviors of all users we could have, train a model and then evaluate it on an independent set of users.
Details for the implementation, especially for data augmentation.
we need to make data augmentation for sequential models. should the data augmentation done before or after the data split (by ratio)?
Yes, I assume we need data augmentation before the split. So it means that the total number of sequence (train + valid + test) will be the same with leave-one-out.
You may notice that I firstly call self.prepare_data_augmentation()
, to make the data augmentation. Then make the split on augmented self.uid_list
instead of self.inter_feat[self.uid_field]
. Going back to your example, the training examples for user 1 will not only contain the full sequence but also contain the rest augmented sequences.
By the way, I notice that the data augmentation for the sequential dataset adds a new attribute mask
, I guess it is related to the BERT4Rec and S^3Rec, is there any latest doc or related implementation for models?
@rowedenny Hi, I think your split strategy is reasonable, we will carefully consider your design and discuss the implementation details, and it may take some time.
About the mask
, it has nothing to do with pre-train model. In fact, it's a fix for your PR (#698).
I guess you misunderstand the meaning of self.uid_list
and we also missed this line when we reviewed your code.
126 local_inter_feat = self.inter_feat[self.uid_list] # here the self.uid_list is not the index of inter_feat
Following your design, we create a variable called mask
and fix this bug in #760.
I see. So please allow me to confirm two things:
self.uid_list
is actually a list of all users corresponding to augmented sequences. So the bug from my PR (#698) is, suppose a user with a sequence [1,2,3,4], after the data augmentation, there will be three sequences as training [1], [1,2], [1,2,3], and all the three have the same user id in the self.uid_list
. However, if we call local_inter_feat = self.inter_feat[self.uid_list]
, it will mistakenly fetch THREE independent users, instead of one. Additionally, self.inter_feat[self.uid_list]
will get the sequences in valid and test set. Is that correct? and the fixation for this is in #760.local_inter_feat
or mask
may be different from leave_one_out split accordingly. Yet in my design, I have not specifically considered it, thus the value mask
may have issue. I will take a closer look at it, and replies with an updated version in short time.Thanks for pointing out my mistake and sorry for the inconvience.
I have looked through my implementation of TO_RS, with the fixation of mask
, to my best knowledge, it should be correct.
Please be free to refer to the design, and please do let me know if there is any issue that I may not be aware.
Thank you!
@rowedenny
self.uid_list is actually a list of all users corresponding to augmented sequences. So the bug from my PR (#698) is, suppose a user with a sequence [1,2,3,4], after the data augmentation, there will be three sequences as training [1], [1,2], [1,2,3], and all the three have the same user id in the self.uid_list. However, if we call local_inter_feat = self.inter_feat[self.uid_list], it will mistakenly fetch THREE independent users, instead of one. Additionally, self.inter_feat[self.uid_list] will get the sequences in valid and test set. Is that correct? and the fixation for this is in #760.
I am sorry to tell you that your understanding is not right. The index of self.inter_feat
is not organized by uid. Here the self.inter_feat
is an interaction
. About the interaction
, you can see docs of interaction
and the code of interaction
.
If you read the code, you will find that we rewrite the __getitem__()
. That's why we can pick the local_inter_feat
by mask
. About the mask
, it's a bool
list and the length is equal to the length of whole inter_feat
. It decides which index you can get in each phase(train, valid and test).
Let's take the leave-one-out splitting as an example:
If the whole sequence is 1,2,3,4,5
and their index is [0,1,2,3,4]
, we will split it as:
Train:
1,2,3
Valid:
1,2,3,4
Test:
1,2,3,4,5
Then, the mask
of each phase will be:
Train:
[1,1,1,0,0]
Valid:
[1,1,1,1,0]
Test:
[1,1,1,1,1]
And in inter_matrix()
, we pick the local_inter_feat
from the whole inter_feat
by mask
in each phase, so what you can get depend on your mask
.
Hope you can understand.
Yes. Thank you for the clarification.
For self.uid_list
in prepare_data_augmentation
, take the case above as an example. Suppose the sequence is from user u_1
, after augmentation, the uid_list would be [1, 1, 1]
, corresponding item_list_index
would be [[1], [1, 2], [1, 2, 3], [1, 2, 3, 4]]
, so the mistaken in local_inter_feat = self.inter_feat[self.uid_list]
is self-clear.
However, as you explain, using self.inter_feat[mask]
would call the mask from train phase, so this is the correct one.
@rowedenny Yes, you are right.
I want my sequential model to train on the first 75% of ML dataset and test on the rest. The split should be temporal ordered and ratio-based. In the end, it throws an error.
Steps to reproduce:
returns
I thought this should work. I tried same script with SASRec and SHAN (both sequential), same result. It works though with general NeuMF. I've got a feeling that I don't understand sth, however nothing related is mentioned in the documentation...