RUCAIBox / RecBole-CDR

This is a library built upon RecBole for cross-domain recommendation algorithms
MIT License
82 stars 12 forks source link

Reported data statistics do not match #57

Open hv-abacus opened 1 year ago

hv-abacus commented 1 year ago

Hi, I downloaded the Amazon dataset from here: https://recbole.s3-accelerate.amazonaws.com/CrossDomain/Amazon.zip
The dataset statistics that you report here do not match with what I compute from the original data.
I removed all rows with NaNs and compute the number of unique values present in the user_id column in the original .inter files. This gives the following statistics:

Number of users in AmazonBooks: 687827
Number of users in AmazonMov: 66317
Number of overlapping users: 27516

Am I doing something wrong?

Wicknight commented 9 months ago

Hello @hv-abacus , It seems that you are not filtering the data. The dataset statistics that you report here were obtained after 10-core filtering, which were specified by parameters 'user_inter_num_interval' and 'item_inter_num_interval' in the yaml file. You can use our yaml file to run code directly on Amazon datasets and you can obtain the same statistics.