Closed pintonos closed 2 years ago
@pintonos Thanks for your attention to RecBole! Can you provide your configuration file and dataset? It may be caused by the mismatched config file. I provide an example config file for movielens dataset with you. Btw, we do not support using LightGCN which is a general model to do sequential recommendation tasks. You can only implement a sequential LightGCN if you wanna do this.
# dataset config
field_separator: "\t"
seq_separator: " "
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
RATING_FIELD: rating
TIME_FIELD: timestamp
NEG_PREFIX: neg_
use_gpu: False
ITEM_LIST_LENGTH_FIELD: item_length
LIST_SUFFIX: _list
MAX_ITEM_LIST_LENGTH: 50
POSITION_FIELD: position_id
load_col:
inter: [user_id, item_id, rating, timestamp]
val_interval:
rating: "[3,inf)"
# training and evaluation
epochs: 500
train_batch_size: 4096
eval_batch_size: 2000
valid_metric: recall@10
eval_args:
split: {'LS': 'valid_and_test'}
mode: full
order: TO
# model
embedding_size: 64
It seems it is not possible to use BPR and neg. sampling with sequential models on custom benchmark datasets? I tried GRU4Rec with BPR and neg. sampling, but it only worked with no benchmark dataset given:
model: GRU4Rec
dataset: diginetica
data_path: "../data/"
#benchmark_filename: ['train', 'valid', 'test']
alias_of_item_id: [item_id_list]
USER_ID_FIELD: session_id
ITEM_ID_FIELD: item_id
TIME_FIELD: timestamp
load_col:
inter: [session_id, item_id, timestamp, item_id_list]
item: [item_id, category_id, name_tokens, pricelog2]
neg_sampling: {'uniform': 1}
loss_type: BPR
Dataset looks like this:
session_id:token item_id:token timestamp:float item_id_list:token_seq session_id_original:token
0 911 1462098192 910 584455
1 6469 1462941978 6470 584464
...
@pintonos Hi, I tried to run with the same configuration as you but I found that neg_sampling worked. Please debug whether the program can run the following function: https://github.com/RUCAIBox/RecBole/blob/4f761699454fd5d8873607173d2514d38ac24031/recbole/data/dataloader/abstract_dataloader.py#L182
@chenyuwuxin Hi, the error is already in this function:
def _neg_sampling(self, inter_feat):
File "C:\Users\...\.virtualenvs\humrecsys-Aky0O99T\lib\site-packages\recbole\data\dataloader\abstract_dataloader.py", line 151, in _neg_sampling
neg_item_ids = self.sampler.sample_by_user_ids(user_ids, item_ids, self.neg_sample_num)
AttributeError: 'NoneType' object has no attribute 'sample_by_user_ids'
As the sampler is None, according to the error message. But only, when I am using benchmark files as in this config:
model: GRU4Rec
dataset: diginetica
data_path: "../data/"
benchmark_filename: ['train', 'valid', 'test']
alias_of_item_id: [item_id_list]
USER_ID_FIELD: session_id
ITEM_ID_FIELD: item_id
TIME_FIELD: timestamp
load_col:
inter: [session_id, item_id, timestamp, item_id_list]
item: [item_id, category_id, name_tokens, pricelog2]
neg_sampling: {'uniform': 1}
loss_type: BPR
@pintonos Can you provide the runtime log? An example log for ml-100k when using benchmark files is as following:
07 Apr 16:41 INFO
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 2020
state = INFO
reproducibility = True
data_path = dataset/ml-100ks
show_progress = True
save_dataset = False
save_dataloaders = False
benchmark_filename = ['train', 'val', 'test']
Training Hyper Parameters:
checkpoint_dir = saved
epochs = 300
train_batch_size = 2048
learner = adam
learning_rate = 0.001
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4
Evaluation Hyper Parameters:
eval_args = {'split': {'LS': 'valid_and_test'}, 'order': 'TO', 'mode': 'full', 'group_by': 'user'}
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [10]
valid_metric = MRR@10
valid_metric_bigger = True
eval_batch_size = 4096
metric_decimal_place = 4
Dataset Hyper Parameters:
field_separator =
seq_separator =
USER_ID_FIELD = user_id
ITEM_ID_FIELD = item_id
RATING_FIELD = rating
TIME_FIELD = timestamp
seq_len = None
LABEL_FIELD = label
threshold = None
NEG_PREFIX = neg_
load_col = {'inter': ['user_id', 'item_id', 'timestamp', 'item_id_list']}
unload_col = None
unused_col = None
additional_feat_suffix = None
rm_dup_inter = None
val_interval = None
filter_inter_by_user_or_item = True
user_inter_num_interval = [0,inf)
item_inter_num_interval = [0,inf)
alias_of_user_id = None
alias_of_item_id = ['item_id_list']
alias_of_entity_id = None
alias_of_relation_id = None
preload_weight = None
normalize_field = None
normalize_all = None
ITEM_LIST_LENGTH_FIELD = item_length
LIST_SUFFIX = _list
MAX_ITEM_LIST_LENGTH = 50
POSITION_FIELD = position_id
HEAD_ENTITY_ID_FIELD = head_id
TAIL_ENTITY_ID_FIELD = tail_id
RELATION_ID_FIELD = relation_id
ENTITY_ID_FIELD = entity_id
Other Hyper Parameters:
neg_sampling = {'uniform': 2}
repeatable = True
embedding_size = 64
hidden_size = 128
num_layers = 1
dropout_prob = 0.3
loss_type = BPR
MODEL_TYPE = ModelType.SEQUENTIAL
MODEL_INPUT_TYPE = InputType.PAIRWISE
eval_type = EvaluatorType.RANKING
device = cuda
train_neg_sample_args = {'strategy': 'by', 'by': 2, 'distribution': 'uniform'}
eval_neg_sample_args = {'strategy': 'full', 'distribution': 'uniform'}
Here is my log with GRU4Rec. Did you use a sequential model?
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 1883
state = INFO
reproducibility = True
data_path = ../data/diginetica
show_progress = True
save_dataset = False
save_dataloaders = False
benchmark_filename = ['train', 'valid', 'test']
Training Hyper Parameters:
checkpoint_dir = saved
epochs = 300
train_batch_size = 512
learner = adam
learning_rate = 0.001
eval_step = 1
stopping_step = 5
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4
Evaluation Hyper Parameters:
eval_args = {'split': {'LS': 'valid_and_test'}, 'order': 'TO', 'mode': 'full', 'group_by': 'user'}
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [10, 20]
valid_metric = MRR@20
valid_metric_bigger = True
eval_batch_size = 4096
metric_decimal_place = 4
Dataset Hyper Parameters:
field_separator =
seq_separator =
USER_ID_FIELD = session_id
ITEM_ID_FIELD = item_id
RATING_FIELD = rating
TIME_FIELD = timestamp
seq_len = None
LABEL_FIELD = label
threshold = None
NEG_PREFIX = neg_
load_col = {'inter': ['session_id', 'item_id', 'timestamp', 'item_id_list'], 'item': ['item_id', 'category_id', 'name_tokens', 'pricelog2']}
unload_col = None
unused_col = None
additional_feat_suffix = None
rm_dup_inter = None
val_interval = None
filter_inter_by_user_or_item = True
user_inter_num_interval = [0,inf)
item_inter_num_interval = [0,inf)
alias_of_user_id = None
alias_of_item_id = ['item_id_list']
alias_of_entity_id = None
alias_of_relation_id = None
preload_weight = None
normalize_field = None
normalize_all = None
ITEM_LIST_LENGTH_FIELD = item_length
LIST_SUFFIX = _list
MAX_ITEM_LIST_LENGTH = 50
POSITION_FIELD = position_id
HEAD_ENTITY_ID_FIELD = head_id
TAIL_ENTITY_ID_FIELD = tail_id
RELATION_ID_FIELD = relation_id
ENTITY_ID_FIELD = entity_id
Other Hyper Parameters:
neg_sampling = {'uniform': 2}
repeatable = True
embedding_size = 128
hidden_size = 128
num_layers = 1
dropout_prob = 0.3
loss_type = BPR
MODEL_TYPE = ModelType.SEQUENTIAL
n_layers = 2
reg_weight = 1e-05
MODEL_INPUT_TYPE = InputType.PAIRWISE
eval_type = EvaluatorType.RANKING
device = cpu
train_neg_sample_args = {'strategy': 'by', 'by': 2, 'distribution': 'uniform'}
eval_neg_sample_args = {'strategy': 'full', 'distribution': 'uniform'}
07 Apr 13:43 INFO diginetica
The number of users: 6389
Average actions of users: 1.0
The number of items: 6471
Average actions of items: 1.3505285412262156
The number of inters: 6388
The sparsity of the dataset: 99.9845488567303%
Remain Fields: ['session_id', 'item_id', 'timestamp', 'item_id_list', 'category_id', 'name_tokens', 'pricelog2', 'item_length']
07 Apr 13:43 INFO GRU4Rec(
(item_embedding): Embedding(6471, 128, padding_idx=0)
(emb_dropout): Dropout(p=0.3, inplace=False)
(gru_layers): GRU(128, 128, bias=False, batch_first=True)
(dense): Linear(in_features=128, out_features=128, bias=True)
(loss_fct): BPRLoss()
)
@pintonos Yes. I tried, but there was no problem. Maybe there is something wrong with your python environment when you try to modify something. So I suggest install the latest version of RecBole through the source code. Here I provide my reproduction environment. Dataset: diginetica (You can split it into train,val and test after adding the field 'item_id_list:token_seq') Yaml:
model: GRU4Rec
benchmark_filename: ['train', 'valid', 'test']
alias_of_item_id: [item_id_list]
USER_ID_FIELD: session_id
ITEM_ID_FIELD: item_id
TIME_FIELD: timestamp
load_col:
inter: [session_id, item_id, timestamp, item_id_list]
item: [item_id, item_category, item_name, item_priceLog2]
neg_sampling: {'uniform': 2}
loss_type: BPR
And command: python run_recbole.py --model=GRU4Rec --dataset=diginetica --config_files=test.yaml
Hi, I am trying to apply LightGCN for sequential recommendation and given benchmark dataset files. So I set the
config['MODEL_TYPE'] = ModelType.SEQUENTIAL
, as I thought this could do the trick. Since I have to use BPR and negative sampling for this model (as recommended), the dataloader is trying to sample negative items, but the sampler is None?Do you know how to use LightGCN correctly as a sequential model? Thanks!