RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.35k stars 603 forks source link

[🐛BUG] timestamp field not in interaction for `ml-1m` #1134

Closed tzuhsial closed 2 years ago

tzuhsial commented 2 years ago

Describe the bug

Timestamp field not loaded in ml-1m dataset.

Same command (model, loss_type) works for ml-100k so expected just changing ml-1m to work

To Reproduce recbole.__version__ = '1.0.0'

python run_recbole.py --dataset=ml-1m --model=SASRec --loss_type=BPR

Expected behavior no error, just like python run_recbole.py --dataset=ml-100k --model=SASRec --loss_type=BPR

Screenshots

Remain Fields: ['user_id', 'item_id']
Traceback (most recent call last):
  File "run_recbole.py", line 25, in <module>
    run_recbole(model=args.model, dataset=args.dataset, config_file_list=config_file_list)
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/quick_start/quick_start.py", line 47, in run_recbole
    train_data, valid_data, test_data = data_preparation(config, dataset)
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/data/utils.py", line 99, in data_preparation
    built_datasets = dataset.build()
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/data/dataset/sequential_dataset.py", line 194, in build
    return super().build()
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/data/dataset/dataset.py", line 1466, in build
    self._change_feat_format()
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/data/dataset/sequential_dataset.py", line 49, in _change_feat_format
    self.data_augmentation()
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/data/dataset/sequential_dataset.py", line 96, in data_augmentation
    self.sort(by=[self.uid_field, self.time_field], ascending=True)
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/data/dataset/dataset.py", line 1457, in sort
    self.inter_feat.sort(by=by, ascending=ascending)
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/data/interaction.py", line 304, in sort
    raise ValueError(f'[{b}] is not exist in interaction [{self}].')
ValueError: [timestamp] is not exist in interaction [The batch_size of interaction: 1000209
    user_id, torch.Size([1000209]), cpu, torch.int64
    item_id, torch.Size([1000209]), cpu, torch.int64

].

Colab Links If applicable, add links to Colab or other Jupyter laboratory platforms that can reproduce the bug.

Desktop (please complete the following information):

Wicknight commented 2 years ago

@tzuhsial Hello! Thanks for your attention to RecBole! According to your description, I guess this is because you don't provide a configuration file(yaml). When you do not provide a configuration file, we will use the default configuration file, which does not necessarily apply to your task, such as not reading the timestamp field. So you should create a configuration file for the current task with parameter ‘load_col’ to load the timestamp field, just as shown below:

load_col: 
    inter: ['user_id', 'item_id', 'rating', 'timestamp']

And here is the official document. As for why the ml-100k dataset works normally, this is because we set special default configuration files for it, in which parameter 'load_col' is set. And you can find these default configuration files in recbole/properties/quick_start_config .

tzuhsial commented 2 years ago

Hi @Wicknight, thanks for the fast response!

After setting the load_col parameters, I received another error.

Here's my run.py

from recbole.quick_start import run_recbole

parameter_dict = {
   'load_col': ['user_id', 'item_id', 'rating', 'timestamp'],
   'eval_args': {'mode': 'uni100', 'distribution': 'uniform', 'order': 'TO'},
   'loss_type': 'BPR'
}
run_recbole(model='GRU4Rec', dataset='ml-1m', config_dict=parameter_dict)

And the error

Traceback (most recent call last):
  File "run.py", line 8, in <module>
    run_recbole(model='GRU4Rec', dataset='ml-1m', config_dict=parameter_dict)
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/quick_start/quick_start.py", line 41, in run_recbole
    dataset = create_dataset(config)
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/data/utils.py", line 41, in create_dataset
    return SequentialDataset(config)
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/data/dataset/sequential_dataset.py", line 36, in __init__
    super().__init__(config)
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/data/dataset/dataset.py", line 96, in __init__
    self._from_scratch()
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/data/dataset/dataset.py", line 108, in _from_scratch
    self._data_processing()
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/data/dataset/dataset.py", line 151, in _data_processing
    self._data_filtering()
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/data/dataset/dataset.py", line 173, in _data_filtering
    self._filter_nan_user_or_item()
  File "/local/home/tzuhsial/hoverboard-workspaces/src/RecBole/recbole/data/dataset/dataset.py", line 634, in _filter_nan_user_or_item
    dropped_inter = self.inter_feat.index[self.inter_feat[field].isnull()]
AttributeError: 'NoneType' object has no attribute 'index'
tzuhsial commented 2 years ago

Encountered the same issue after preparing my own .inter file.

Headers: user_id:token item_id:token timestamp:float

tzuhsial commented 2 years ago

Found config error should be specified with inter

Specified

`load_col': {'inter':  ['user_id', 'item_id', 'rating', 'timestamp']},

worked for me.

tzuhsial commented 2 years ago

It's likely I didn't find it in the code or documentation, but I would appreciate if the authors could provide a full config. (either python script) or Yaml that includes all existing parameters.

My main friction so far had been with config errors. FYI @Wicknight

Wicknight commented 2 years ago

Hello! Like your code, parameter configuration can be realized by specifying a dictionary in the code. Also, it can be done by reading an external parameter configuration file. The setting method I mentioned earlier is for the external configuration file. The following example is an external configuration file:

# general
gpu_id: 0
use_gpu: True
seed: 2020
state: INFO
reproducibility: True
data_path: 'dataset/'
checkpoint_dir: 'saved'
show_progress: True
save_dataset: False
save_dataloaders: False

# training settings
epochs: 300
train_batch_size: 2048
learner: adam
learning_rate: 0.001
neg_sampling:
  uniform: 1
eval_step: 1
stopping_step: 10
clip_grad_norm: ~
# clip_grad_norm:  {'max_norm': 5, 'norm_type': 2}
weight_decay: 0.0
require_pow: False
load_col: 
    inter: ['user_id', 'item_id', 'rating', 'timestamp']

# evaluation settings
eval_args: 
  split: {'RS':[0.8,0.1,0.1]}
  group_by: user
  order: RO
  mode: full
repeatable: False
metrics: ["Recall","MRR","NDCG","Hit","Precision"]
topk: [10]
valid_metric: MRR@10
valid_metric_bigger: True
eval_batch_size: 4096
loss_decimal_place: 4
metric_decimal_place: 4

Here I can call it "exam.yaml". And then we can use it through the following code:

config_file_list = ['exam.yaml']
run_recbole(model='GRU4Rec', dataset='ml-1m', config_file_list=config_file_list)
tzuhsial commented 2 years ago

Thanks, YAML configs are great! Let me try them out :)

banmeng123 commented 1 year ago

我按照你上面配置的运行文件后, config_file_list = ['exam.yaml'] run_recbole(model='GRU4Rec', dataset='ml-1m', config_file_list=config_file_list) 一直出现如下错误: 1681641073(1)

就算是根据csdn出的sequential model快速入门的配置,也会出现一样的错误,请问这是什么呀?求帮帮忙

banmeng123 commented 1 year ago

另外,我还显示如下的错误提示: 1681641732(1) 我使用的是 ml-100k 和 ml-1m.inter的数据集,两个都会报这样的错误