RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.32k stars 601 forks source link

[🐛BUG] Attribute Error when using benchmark filename #1593

Open cramraj8 opened 1 year ago

cramraj8 commented 1 year ago

I am trying to load my separate files for train, val, and test using benchmark_filename parameter in config file. However, I end up giving this below Attribute error,

File "train_run_benchmarkfile.py", line 2, in <module>
    run_recbole(model='GRU4Rec', dataset='DIGINETICA_ALL', config_file_list=['train_config_benchmarkfile.yaml'])
  File "/home/XXX/XXX/RecBole/RecBole/recbole/quick_start/quick_start.py", line 69, in run_recbole
    dataset = create_dataset(config)
  File "/home/XXX/XXX/RecBole/RecBole/recbole/data/utils.py", line 70, in create_dataset
    dataset = dataset_class(config)
  File "/home/XXX/XXX/RecBole/RecBole/recbole/data/dataset/sequential_dataset.py", line 38, in _init_
    self._benchmark_presets()
  File "/home/XXX/XXX/RecBole/RecBole/recbole/data/dataset/sequential_dataset.py", line 164, in _benchmark_presets
    self.item_id_list_field
AttributeError: 'SequentialDataset' object has no attribute 'item_id_list_field' 

When I look for the variable item_id_list_field, it was not initialized. I wonder what is the default value for that.

leoleojie commented 1 year ago

@cramraj8 Hello! Thanks for your attention to RecBole! As we all know, in sequential recommendation, we should load user interaction sequences. In general mode of RecBole, we will process the raw data and obtain the user interaction sequences automatically. However, if the dataset is organized as Becnmark files, RecBole will not process the data, which means you should preprocess the dataset to get the sequential interaction sequences. That is, the benchmark files you provide should contain interaction sequences data of users like: user_id:token item_id_list:token_seq item_id:token 0 0 1 2 3 4 5 BTW, in benchmark mode, we also won't do data augmentation on sequence. If you need this functionality, you can also preprocess the dataset like: user_id:token item_id_list:token_seq item_id:token 0 0 1 2 3 4 5 0 0 1 2 3 4 0 0 1 2 3 ...

cramraj8 commented 1 year ago

@leoleojie Thanks for the explanation. It really helped. However, after doing so I am still getting an error inside the model part. Error is about some index out of error.

Traceback (most recent call last):
  File "run.py", line 3, in <module>
    run_recbole(model='GRU4Rec', dataset='mind_small_benchmarkfile_small', config_file_list=['config.yaml'])
  File "/home/xxx/xxx/RecBole/RecBole/recbole/quick_start/quick_start.py", line 81, in run_recbole
    flops = get_flops(model, dataset, config["device"], logger, transform)
  File "/home/xxx/xxx/RecBole/RecBole/recbole/utils/utils.py", line 345, in get_flops
    wrapper(*inputs)
  File "/opt/conda/miniconda/envs/paper/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xxx/xxx/RecBole/RecBole/recbole/utils/utils.py", line 286, in forward
    return self.model.predict(interaction)
  File "/home/xxx/xxx/RecBole/RecBole/recbole/model/sequential_recommender/gru4rec.py", line 110, in predict
    seq_output = self.forward(item_seq, item_seq_len)
  File "/home/xxx/xxx/RecBole/RecBole/recbole/model/sequential_recommender/gru4rec.py", line 84, in forward
    seq_output = self.gather_indexes(gru_output, item_seq_len - 1)
  File "/home/xxx/xxx/RecBole/RecBole/recbole/model/abstract_recommender.py", line 174, in gather_indexes
    output_tensor = output.gather(dim=1, index=gather_index)
RuntimeError: index 2999 is out of bounds for dimension 1 with size 84

Do you have any idea why this error is happening ? I also set _alias_of_item_id: ['item_id_list']_ in the config file. Thanks for your reply!

leoleojie commented 1 year ago

The bug happened in line 84 of gru4rec.py seq_output = self.gather_indexes(gru_output, item_seq_len - 1)

It seems like the item_seq_len you passed in is not the right length of interaction sequences. But under normal conditions, when recbole read the benchmark files, the item_seq_len is calculated automatically as self.inter_feat[self.item_list_length_field] = self.inter_feat[ self.item_id_list_field ].agg(len)

So it may be still something wrong in the process of benchmark files. Could you please print the interaction sequence item_seq = interaction[self.ITEM_SEQ] and item_seq_len = interaction[self.ITEM_SEQ_LEN]. And check whether they are relevant?

And another simple way is calculating the item_seq_len as item_seq_len = (item_seq!=0).sum(1) by yourself. And check whether the bug still exist?

cramraj8 commented 1 year ago

I see. I guess the input passed to GRU4Rec should be {item_seq, item_seq_len} in each mini-batches. But I am getting shapes of [1, 84] and [1] correspondingly. The item_seq_len value is [3000]. Something is definitely wrong when parsing benchmark files to train & eval.

For the benchmark file running scenarios, what changes we should make to the config file & python script ? What columns in the .inter file should be provided [right now I only give user_id, item_id, item_id_list, item_seq_len] ?

Attached is my config file.

# dataset config : Sequential Recommendation
gpu_id: -1
data_path: /home/xxx/xxx/
benchmark_filename: ['train', 'dev', 'dev']
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
load_col:
    inter: [user_id, item_id, item_id_list, item_seq_len]
alias_of_item_id: ['item_id_list']
ITEM_LIST_LENGTH_FIELD: item_length
LIST_SUFFIX: _list
MAX_ITEM_LIST_LENGTH: 10

# model config
embedding_size: 64
hidden_size: 128
num_layers: 1
dropout_prob: 0.3
loss_type: 'CE'

# Training and evaluation config
epochs: 500
train_batch_size: 4096
eval_batch_size: 4096
train_neg_sample_args: ~
eval_args:
  split: {'LS': 'valid_and_test'}
  order: TO
  mode: full
metrics: ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk: 10
valid_metric: MRR@10
metric_decimal_place: 4
Sherry-XLL commented 1 year ago

Hello, @cramraj8!

We have provided the sample code for running session-based recommendation benchmarks in session_based_rec_example, and args.dataset can be one of diginetica-session, tmall-session and nowplaying-session.

# session-based recommendation benchmarks
diginetica-session: https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/DIGINETICA/session/diginetica_session.zip
tmall-session: https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Tmall/session/tmall_session.zip
nowplaying-session: https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Nowplaying/session/nowplaying_session.zip

You can refer to the sample code and our processed datasets for more details about the benchmark_file.

AbdElrahmanMostafaRifaat1432 commented 1 year ago

can anyone here help me with my problem in #1670 any help will be appreciated from you