RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.37k stars 606 forks source link

[🐛BUG] Problem with Amazon Digital Music #1520

Open mmosc opened 1 year ago

mmosc commented 1 year ago

Hi and thank you for this awesome library! (I will never get tired of saying that :sweat_smile: )

I wanted to use Amazon Digital Music as dataset for recommendation. I tried both downloading the original file from here and use the conversion tools, as well as the already processed atomic file from your google drive. I then run run_hyper.py. However, in both cases I get the following error:

ValueError: [timestamp] is not exist in interaction [The batch_size of interaction: 836006
    user_id, torch.Size([836006]), cpu, torch.int64
    item_id, torch.Size([836006]), cpu, torch.int64

].

Any hints on how to solve this?

Cheers, Marta

AoiDragon commented 1 year ago

Hello @mmosc ,

It seems that Amazon Digital Music has no timestamp attribute but the model tried to sort the interaction by time. You can try to avoid sorting the dataset by time and use other othering strategies like random ordering.

mmosc commented 1 year ago

I am afraid this is not the problem. The dataset does indeed contain a timestamp attribute, as you can see from the RecBole atomic files, here:

user_id:token   item_id:token   rating:float    timestamp:float
A1ZCPG3D3HGRSS  0001388703  5.0 1387670400
AC2PL52NKPL29   0001388703  5.0 1378857600
A1SUZXBDZSDQ3A  0001388703  5.0 1362182400
A3A0W7FZXM0IZW  0001388703  5.0 1354406400
A12R54MKO17TW0  0001388703  5.0 1325894400
bardia-mhd commented 1 year ago

Hello

I face with the same problem too! using BERT4Rec and Amazon_Electronics dataset

Sherry-XLL commented 1 year ago

Hello @mmosc @bardia-mhd!

Maybe the dataset is not loaded correctly due to improper configurations. The reference configuration file of Amazon datasets is as follows:

# dataset config
field_separator: "\t"
seq_separator: " "
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
RATING_FIELD: rating
TIME_FIELD: timestamp
NEG_PREFIX: neg_
ITEM_LIST_LENGTH_FIELD: item_length
LIST_SUFFIX: _list
MAX_ITEM_LIST_LENGTH: 50
POSITION_FIELD: position_id
load_col:
    inter: [user_id, item_id, rating, timestamp]

# data filtering for interactions
val_interval:
    rating: "[3,inf)"    
unused_col: 
    inter: [rating]

user_inter_num_interval: "[10,inf)"
item_inter_num_interval: "[10,inf)"

# training and evaluation
epochs: 500
train_batch_size: 4096
eval_batch_size: 40960000
valid_metric: NDCG@10
eval_args:
    split: {'LS': 'valid_and_test'}
    mode: full
    order: TO

# disable negative sampling
train_neg_sample_args: ~

Thanks for your attention to RecBole, and feel free to comment if you have further questions.