RUCAIBox / RecBole-GNN

Efficient and extensible GNNs enhanced recommender library based on RecBole.
MIT License
167 stars 37 forks source link

Question: batch of training #65

Closed downeykking closed 1 year ago

downeykking commented 1 year ago

Hello, I wonder to ask a question about the samples of training.

If we have users = [0, 1] and pos_items = {0: [1, 2], 1: [3, 4]}, in recbole, what data will be generated if num_neg is 1 and sample method is uniform?

To my understaning, we will get four training samples user, pos_item, neg_item like (0, 1, 3), (0, 2, 4), (1, 3, 1), (1, 4, 2), we will use all rating data. Do I understand correctly?

But in LightGCN and NGCF, SGL, i find that they all use the way to random select users, and then generate samples for training. for examples, if num rating is 100000, LightGCN will randomly generate 100000 users, and then sample pos ang neg for them. But in recbole, we will directly use the 100000 rating as user and pos.

I am little confused by the difference of them. Which one is usually used and which one is more reasonable?

Thanks in advance!

hyp1231 commented 1 year ago

Hi, thanks for your attention! It's really a good question.

The concern is mainly about how to sample training instances, whether interaction-oriented (implemented in RecBole) or user-oriented (implemented as you described). One can refer to Steffen Rendle. Item Recommendation from Implicit Feedback. [paper], where interaction-oriented sampling is regarded as the basic choice, named "Uniform Sampling without Weight". In this perspective, user-oriented sampling is actually a weighted strategy to sample interactions, i.e., higher weights for interactions from cold-start users.

To the best of my knowledge, there has not been a paper discussing which one of these two strategies is better. (Please feel free to let me know if I missed something, thx!!) Empirically, I believe that it's fair when the comparisons between different methods are under the same sampling strategy. The effects of these two strategies should be further explored.

downeykking commented 1 year ago

Hi, thanks for your attention! It's really a good question.

The concern is mainly about how to sample training instances, whether interaction-oriented (implemented in RecBole) or user-oriented (implemented as you described). One can refer to Steffen Rendle. Item Recommendation from Implicit Feedback. [paper], where interaction-oriented sampling is regarded as the basic choice, named "Uniform Sampling without Weight". In this perspective, user-oriented sampling is actually a weighted strategy to sample interactions, i.e., higher weights for interactions from cold-start users.

To the best of my knowledge, there has not been a paper discussing which one of these two strategies is better. (Please feel free to let me know if I missed something, thx!!) Empirically, I believe that it's fair when the comparisons between different methods are under the same sampling strategy. The effects of these two strategies should be further explored.

Thanks for your kind reply. It really helps a lot!

And another question is in recbole, if we want to spilt data by ratio into train, valid, test, from https://github.com/RUCAIBox/RecBole/blob/master/recbole/data/dataset/dataset.py#L1623 and https://github.com/RUCAIBox/RecBole/blob/23fdeb00f9334b66d23e12b73eb4fd01d413dccd/recbole/properties/overall.yaml#L43

If we have ratings {user,item} like {1:2}, {1:3}, {1:4}, {2:5}, {2:6} ,in recbole we will get train {1:2}, {1:3}, {2:5}, test {1:4}, {2:6}, instead of randomly spilt the samples like train {1:2}, {1:3}, {1:4}, test {2:5}, {2:6} do i understand correctly?

hyp1231 commented 1 year ago

Yes! In our implementation, the dataset are split by users as you described.

downeykking commented 1 year ago

Yes! In our implementation, the dataset are split by users as you described.

Thank you for your reply! :)