RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.34k stars 604 forks source link

LightGCN for sequential recommendation #1228

Closed pintonos closed 2 years ago

pintonos commented 2 years ago

Hi, I am trying to apply LightGCN for sequential recommendation and given benchmark dataset files. So I set the config['MODEL_TYPE'] = ModelType.SEQUENTIAL, as I thought this could do the trick. Since I have to use BPR and negative sampling for this model (as recommended), the dataloader is trying to sample negative items, but the sampler is None?

File "C:\Users\...\.virtualenvs\humrecsys-Aky0O99T\lib\site-packages\recbole\data\dataloader\abstract_dataloader.py", line 151, in _neg_sampling
    neg_item_ids = self.sampler.sample_by_user_ids(user_ids, item_ids, self.neg_sample_num)
AttributeError: 'NoneType' object has no attribute 'sample_by_user_ids'

Do you know how to use LightGCN correctly as a sequential model? Thanks!

Ethan-TZ commented 2 years ago

@pintonos Thanks for your attention to RecBole! Can you provide your configuration file and dataset? It may be caused by the mismatched config file. I provide an example config file for movielens dataset with you. Btw, we do not support using LightGCN which is a general model to do sequential recommendation tasks. You can only implement a sequential LightGCN if you wanna do this.

# dataset config
field_separator: "\t"
seq_separator: " "
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
RATING_FIELD: rating
TIME_FIELD: timestamp
NEG_PREFIX: neg_
use_gpu: False
ITEM_LIST_LENGTH_FIELD: item_length
LIST_SUFFIX: _list
MAX_ITEM_LIST_LENGTH: 50
POSITION_FIELD: position_id
load_col:
  inter: [user_id, item_id, rating, timestamp]
val_interval:
  rating: "[3,inf)"

# training and evaluation
epochs: 500
train_batch_size: 4096
eval_batch_size: 2000
valid_metric: recall@10
eval_args:
  split: {'LS': 'valid_and_test'}
  mode: full
  order: TO

# model
embedding_size: 64 
pintonos commented 2 years ago

It seems it is not possible to use BPR and neg. sampling with sequential models on custom benchmark datasets? I tried GRU4Rec with BPR and neg. sampling, but it only worked with no benchmark dataset given:

model: GRU4Rec
dataset: diginetica

data_path: "../data/"

#benchmark_filename: ['train', 'valid', 'test']

alias_of_item_id: [item_id_list]

USER_ID_FIELD: session_id
ITEM_ID_FIELD: item_id
TIME_FIELD: timestamp

load_col:
    inter: [session_id, item_id, timestamp, item_id_list]
    item: [item_id, category_id, name_tokens, pricelog2]

neg_sampling: {'uniform': 1}
loss_type: BPR

Dataset looks like this:

session_id:token    item_id:token   timestamp:float item_id_list:token_seq  session_id_original:token
0   911 1462098192  910 584455
1   6469    1462941978  6470    584464
...
Ethan-TZ commented 2 years ago

@pintonos Hi, I tried to run with the same configuration as you but I found that neg_sampling worked. Please debug whether the program can run the following function: https://github.com/RUCAIBox/RecBole/blob/4f761699454fd5d8873607173d2514d38ac24031/recbole/data/dataloader/abstract_dataloader.py#L182

pintonos commented 2 years ago

@chenyuwuxin Hi, the error is already in this function: def _neg_sampling(self, inter_feat):

File "C:\Users\...\.virtualenvs\humrecsys-Aky0O99T\lib\site-packages\recbole\data\dataloader\abstract_dataloader.py", line 151, in _neg_sampling
    neg_item_ids = self.sampler.sample_by_user_ids(user_ids, item_ids, self.neg_sample_num)
AttributeError: 'NoneType' object has no attribute 'sample_by_user_ids'

As the sampler is None, according to the error message. But only, when I am using benchmark files as in this config:

model: GRU4Rec
dataset: diginetica

data_path: "../data/"

benchmark_filename: ['train', 'valid', 'test']

alias_of_item_id: [item_id_list]

USER_ID_FIELD: session_id
ITEM_ID_FIELD: item_id
TIME_FIELD: timestamp

load_col:
    inter: [session_id, item_id, timestamp, item_id_list]
    item: [item_id, category_id, name_tokens, pricelog2]
neg_sampling: {'uniform': 1}
loss_type: BPR
Ethan-TZ commented 2 years ago

@pintonos Can you provide the runtime log? An example log for ml-100k when using benchmark files is as following:

07 Apr 16:41    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 2020
state = INFO
reproducibility = True
data_path = dataset/ml-100ks
show_progress = True
save_dataset = False
save_dataloaders = False
benchmark_filename = ['train', 'val', 'test']

Training Hyper Parameters:
checkpoint_dir = saved
epochs = 300
train_batch_size = 2048
learner = adam
learning_rate = 0.001
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'LS': 'valid_and_test'}, 'order': 'TO', 'mode': 'full', 'group_by': 'user'}
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [10]
valid_metric = MRR@10
valid_metric_bigger = True
eval_batch_size = 4096
metric_decimal_place = 4

Dataset Hyper Parameters:
field_separator =       
seq_separator =  
USER_ID_FIELD = user_id
ITEM_ID_FIELD = item_id
RATING_FIELD = rating
TIME_FIELD = timestamp
seq_len = None
LABEL_FIELD = label
threshold = None
NEG_PREFIX = neg_
load_col = {'inter': ['user_id', 'item_id', 'timestamp', 'item_id_list']}
unload_col = None
unused_col = None
additional_feat_suffix = None
rm_dup_inter = None
val_interval = None
filter_inter_by_user_or_item = True
user_inter_num_interval = [0,inf)
item_inter_num_interval = [0,inf)
alias_of_user_id = None
alias_of_item_id = ['item_id_list']
alias_of_entity_id = None
alias_of_relation_id = None
preload_weight = None
normalize_field = None
normalize_all = None
ITEM_LIST_LENGTH_FIELD = item_length
LIST_SUFFIX = _list
MAX_ITEM_LIST_LENGTH = 50
POSITION_FIELD = position_id
HEAD_ENTITY_ID_FIELD = head_id
TAIL_ENTITY_ID_FIELD = tail_id
RELATION_ID_FIELD = relation_id
ENTITY_ID_FIELD = entity_id

Other Hyper Parameters:
neg_sampling = {'uniform': 2}
repeatable = True
embedding_size = 64
hidden_size = 128
num_layers = 1
dropout_prob = 0.3
loss_type = BPR
MODEL_TYPE = ModelType.SEQUENTIAL
MODEL_INPUT_TYPE = InputType.PAIRWISE
eval_type = EvaluatorType.RANKING
device = cuda
train_neg_sample_args = {'strategy': 'by', 'by': 2, 'distribution': 'uniform'}
eval_neg_sample_args = {'strategy': 'full', 'distribution': 'uniform'} 
pintonos commented 2 years ago

Here is my log with GRU4Rec. Did you use a sequential model?

General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 1883
state = INFO
reproducibility = True
data_path = ../data/diginetica
show_progress = True
save_dataset = False
save_dataloaders = False
benchmark_filename = ['train', 'valid', 'test']

Training Hyper Parameters:
checkpoint_dir = saved
epochs = 300
train_batch_size = 512
learner = adam
learning_rate = 0.001
eval_step = 1
stopping_step = 5
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'LS': 'valid_and_test'}, 'order': 'TO', 'mode': 'full', 'group_by': 'user'}
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [10, 20]
valid_metric = MRR@20
valid_metric_bigger = True
eval_batch_size = 4096
metric_decimal_place = 4

Dataset Hyper Parameters:
field_separator =   
seq_separator =  
USER_ID_FIELD = session_id
ITEM_ID_FIELD = item_id
RATING_FIELD = rating
TIME_FIELD = timestamp
seq_len = None
LABEL_FIELD = label
threshold = None
NEG_PREFIX = neg_
load_col = {'inter': ['session_id', 'item_id', 'timestamp', 'item_id_list'], 'item': ['item_id', 'category_id', 'name_tokens', 'pricelog2']}
unload_col = None
unused_col = None
additional_feat_suffix = None
rm_dup_inter = None
val_interval = None
filter_inter_by_user_or_item = True
user_inter_num_interval = [0,inf)
item_inter_num_interval = [0,inf)
alias_of_user_id = None
alias_of_item_id = ['item_id_list']
alias_of_entity_id = None
alias_of_relation_id = None
preload_weight = None
normalize_field = None
normalize_all = None
ITEM_LIST_LENGTH_FIELD = item_length
LIST_SUFFIX = _list
MAX_ITEM_LIST_LENGTH = 50
POSITION_FIELD = position_id
HEAD_ENTITY_ID_FIELD = head_id
TAIL_ENTITY_ID_FIELD = tail_id
RELATION_ID_FIELD = relation_id
ENTITY_ID_FIELD = entity_id

Other Hyper Parameters: 
neg_sampling = {'uniform': 2}
repeatable = True
embedding_size = 128
hidden_size = 128
num_layers = 1
dropout_prob = 0.3
loss_type = BPR
MODEL_TYPE = ModelType.SEQUENTIAL
n_layers = 2
reg_weight = 1e-05
MODEL_INPUT_TYPE = InputType.PAIRWISE
eval_type = EvaluatorType.RANKING
device = cpu
train_neg_sample_args = {'strategy': 'by', 'by': 2, 'distribution': 'uniform'}
eval_neg_sample_args = {'strategy': 'full', 'distribution': 'uniform'}

07 Apr 13:43    INFO  diginetica
The number of users: 6389
Average actions of users: 1.0
The number of items: 6471
Average actions of items: 1.3505285412262156
The number of inters: 6388
The sparsity of the dataset: 99.9845488567303%
Remain Fields: ['session_id', 'item_id', 'timestamp', 'item_id_list', 'category_id', 'name_tokens', 'pricelog2', 'item_length']
07 Apr 13:43    INFO  GRU4Rec(
  (item_embedding): Embedding(6471, 128, padding_idx=0)
  (emb_dropout): Dropout(p=0.3, inplace=False)
  (gru_layers): GRU(128, 128, bias=False, batch_first=True)
  (dense): Linear(in_features=128, out_features=128, bias=True)
  (loss_fct): BPRLoss()
)
Ethan-TZ commented 2 years ago

@pintonos Yes. I tried, but there was no problem. Maybe there is something wrong with your python environment when you try to modify something. So I suggest install the latest version of RecBole through the source code. Here I provide my reproduction environment. Dataset: diginetica (You can split it into train,val and test after adding the field 'item_id_list:token_seq') Yaml:

model: GRU4Rec

benchmark_filename: ['train', 'valid', 'test']

alias_of_item_id: [item_id_list]

USER_ID_FIELD: session_id
ITEM_ID_FIELD: item_id
TIME_FIELD: timestamp

load_col:
    inter: [session_id, item_id, timestamp, item_id_list]
    item: [item_id, item_category, item_name, item_priceLog2]

neg_sampling: {'uniform': 2}
loss_type: BPR

And command: python run_recbole.py --model=GRU4Rec --dataset=diginetica --config_files=test.yaml