Open cramraj8 opened 1 year ago
@cramraj8 Hello! Thanks for your attention to RecBole!
As we all know, in sequential recommendation, we should load user interaction sequences. In general mode of RecBole, we will process the raw data and obtain the user interaction sequences automatically. However, if the dataset is organized as Becnmark files, RecBole will not process the data, which means you should preprocess the dataset to get the sequential interaction sequences. That is, the benchmark files you provide should contain interaction sequences data of users like:
user_id:token item_id_list:token_seq item_id:token
0 0 1 2 3 4 5
BTW, in benchmark mode, we also won't do data augmentation on sequence. If you need this functionality, you can also preprocess the dataset like:
user_id:token item_id_list:token_seq item_id:token
0 0 1 2 3 4 5
0 0 1 2 3 4
0 0 1 2 3
...
@leoleojie Thanks for the explanation. It really helped. However, after doing so I am still getting an error inside the model part. Error is about some index out of error.
Traceback (most recent call last):
File "run.py", line 3, in <module>
run_recbole(model='GRU4Rec', dataset='mind_small_benchmarkfile_small', config_file_list=['config.yaml'])
File "/home/xxx/xxx/RecBole/RecBole/recbole/quick_start/quick_start.py", line 81, in run_recbole
flops = get_flops(model, dataset, config["device"], logger, transform)
File "/home/xxx/xxx/RecBole/RecBole/recbole/utils/utils.py", line 345, in get_flops
wrapper(*inputs)
File "/opt/conda/miniconda/envs/paper/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xxx/xxx/RecBole/RecBole/recbole/utils/utils.py", line 286, in forward
return self.model.predict(interaction)
File "/home/xxx/xxx/RecBole/RecBole/recbole/model/sequential_recommender/gru4rec.py", line 110, in predict
seq_output = self.forward(item_seq, item_seq_len)
File "/home/xxx/xxx/RecBole/RecBole/recbole/model/sequential_recommender/gru4rec.py", line 84, in forward
seq_output = self.gather_indexes(gru_output, item_seq_len - 1)
File "/home/xxx/xxx/RecBole/RecBole/recbole/model/abstract_recommender.py", line 174, in gather_indexes
output_tensor = output.gather(dim=1, index=gather_index)
RuntimeError: index 2999 is out of bounds for dimension 1 with size 84
Do you have any idea why this error is happening ? I also set _alias_of_item_id: ['item_id_list']_ in the config file. Thanks for your reply!
The bug happened in line 84 of gru4rec.py
seq_output = self.gather_indexes(gru_output, item_seq_len - 1)
It seems like the item_seq_len
you passed in is not the right length of interaction sequences. But under normal conditions, when recbole read the benchmark files, the item_seq_len
is calculated automatically as
self.inter_feat[self.item_list_length_field] = self.inter_feat[ self.item_id_list_field ].agg(len)
So it may be still something wrong in the process of benchmark files. Could you please print the interaction sequence item_seq = interaction[self.ITEM_SEQ]
and item_seq_len = interaction[self.ITEM_SEQ_LEN]
. And check whether they are relevant?
And another simple way is calculating the item_seq_len
as item_seq_len = (item_seq!=0).sum(1)
by yourself. And check whether the bug still exist?
I see. I guess the input passed to GRU4Rec should be {item_seq, item_seq_len} in each mini-batches. But I am getting shapes of [1, 84] and [1] correspondingly. The item_seq_len value is [3000]. Something is definitely wrong when parsing benchmark files to train & eval.
For the benchmark file running scenarios, what changes we should make to the config file & python script ? What columns in the .inter file should be provided [right now I only give user_id, item_id, item_id_list, item_seq_len] ?
Attached is my config file.
# dataset config : Sequential Recommendation
gpu_id: -1
data_path: /home/xxx/xxx/
benchmark_filename: ['train', 'dev', 'dev']
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
load_col:
inter: [user_id, item_id, item_id_list, item_seq_len]
alias_of_item_id: ['item_id_list']
ITEM_LIST_LENGTH_FIELD: item_length
LIST_SUFFIX: _list
MAX_ITEM_LIST_LENGTH: 10
# model config
embedding_size: 64
hidden_size: 128
num_layers: 1
dropout_prob: 0.3
loss_type: 'CE'
# Training and evaluation config
epochs: 500
train_batch_size: 4096
eval_batch_size: 4096
train_neg_sample_args: ~
eval_args:
split: {'LS': 'valid_and_test'}
order: TO
mode: full
metrics: ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk: 10
valid_metric: MRR@10
metric_decimal_place: 4
Hello, @cramraj8!
We have provided the sample code for running session-based recommendation benchmarks in session_based_rec_example, and args.dataset
can be one of diginetica-session
, tmall-session
and nowplaying-session
.
# session-based recommendation benchmarks
diginetica-session: https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/DIGINETICA/session/diginetica_session.zip
tmall-session: https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Tmall/session/tmall_session.zip
nowplaying-session: https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Nowplaying/session/nowplaying_session.zip
You can refer to the sample code and our processed datasets for more details about the benchmark_file
.
can anyone here help me with my problem in #1670 any help will be appreciated from you
I am trying to load my separate files for train, val, and test using benchmark_filename parameter in config file. However, I end up giving this below Attribute error,
When I look for the variable item_id_list_field, it was not initialized. I wonder what is the default value for that.