RUCAIBox / RecSysDatasets

This is a repository of public data sources for Recommender Systems (RS).
https://recbole.io/
868 stars 132 forks source link

Yelp Statistics Question #95

Closed RuihongQiu closed 3 years ago

RuihongQiu commented 3 years ago

Hi,

Thank you for providing so many processed datasets.

I have a question when using the yelp dataset with RecBole.

I mainly use it for the sequential recommendation.

The general statistics of yelp is like: From S3Rec paper image From BERT4Rec paper image

They say they both filter out the items and users appearing less than 5 times. I firstly download the processed dataset from google drive. When I set the dataset config in RecBole as:

max_user_inter_num = None
min_user_inter_num = 5
max_item_inter_num = None
min_item_inter_num = 5

The logging statistics of yelp is as:

29 Jul 20:04    INFO  yelp
The number of users: 329840
Average actions of users: 15.778049290714561
The number of items: 124462
Average actions of items: 41.81403009778163
The number of inters: 5204216
The sparsity of the dataset: 99.98732303718786%

Why is it so large and so different from the other papers? Is that my config is wrong somewhere?

RuihongQiu commented 3 years ago

I found out that in the preprocessing of S3Rec, there is a time filtering. Only interactions in 2019 will be considered. Are there any similar args can be used in RecBole?

hyp1231 commented 3 years ago

Hi, In RecBole 0.2.1, you can use lowest_val and highest_val to filter interactions. Details can be found in our API Doc.

RuihongQiu commented 3 years ago

Thank you for the great suggestion.

Using the following config:

min_user_inter_num: 5
min_item_inter_num: 5
lowest_val:
    timestamp: 1546264800
highest_val:
    timestamp: 1577714400

Already have a similar result:

31 Jul 10:08    INFO  yelp
The number of users: 30500
Average actions of users: 10.399750811502017
The number of items: 20069
Average actions of items: 15.805361769982062
The number of inters: 317182
The sparsity of the dataset: 99.94818172387231%