RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.43k stars 614 forks source link

BERT4REC and MIND #1014

Closed deeplearningnrs closed 2 years ago

deeplearningnrs commented 3 years ago

I am trying to run on MIND on BERT4REC, I get the .inter file with MIND using the instructions in recbole.io. These are my parameters

topk : [5,10,20,50]
epochs : 5
loss_type : BPR
metrics: ["Recall", "Precision","Hit", "MRR", "NDCG",  "GiniIndex"]
valid_metric: MRR@5
load_col:
  inter:[user_id,item_id,timestamp]

but I am getting error


 File "/content/RecBole/recbole/data/dataset/dataset.py", line 364, in _get_load_and_unload_col
    elif self.config['load_col'][source] == '*':
TypeError: string indices must be integers

I also get this error:

ValueError: [timestamp] is not exist in interaction [The batch_size of interaction: 5843444 user_id, torch.Size([5843444]), cpu, torch.int64 item_id, torch.Size([5843444]), cpu, torch.int64

I do not get this error with ml-100k

Main issue is the memory crash

tcmalloc: large alloc 9269518336 bytes == 0x55ed84b08000 @ 0x7f9d61868b6b 0x7f9d61888379 0x7f9cf3723b4a 0x7f9cf37255fa 0x7f9cf5a5578a 0x7f9cf5c9e30b 0x7f9cf5ce5b37 0x7f9cf5a560b0 0x7f9cf5a5fd95 0x7f9cf5d99973 0x7f9cf5ddd709 0x7f9d3e7ccf93 0x7f9d3e5da303 0x55ed61fe0544 0x55ed61fe0240 0x55ed62054627 0x55ed61fe1afa 0x55ed6204fc0d 0x55ed6204eced 0x55ed61fe1bda 0x55ed6204fc0d 0x55ed6204eced 0x55ed61fe1bda 0x55ed62053d00 0x55ed6204eced 0x55ed61fe1bda 0x55ed6204fc0d 0x55ed6204e9ee 0x55ed61fe1bda 0x55ed6204f915 0x55ed6204e9ee tcmalloc: large alloc 9269518336 bytes == 0x55efadb56000 @ 0x7f9d61868b6b 0x7f9d61888379 0x7f9cf3723b4a 0x7f9cf37255fa 0x7f9cf5a5578a 0x7f9cf5c9e30b 0x7f9cf5ce5b37 0x7f9cf5a560b0 0x7f9cf5a5fd95 0x7f9cf5d99973 0x7f9cf5ddd709 0x7f9d3e7ccf93 0x7f9d3e5da303 0x55ed61fe0544 0x55ed61fe0240 0x55ed62054627 0x55ed61fe1afa 0x55ed6204fc0d 0x55ed6204eced 0x55ed61fe1bda 0x55ed6204fc0d 0x55ed6204eced 0x55ed61fe1bda 0x55ed62053d00 0x55ed6204eced 0x55ed61fe1bda 0x55ed6204fc0d 0x55ed6204e9ee 0x55ed61fe1bda 0x55ed6204f915 0x55ed6204e9ee tcmalloc: large alloc 9269518336 bytes == 0x55f1d6b70000 @ 0x7f9d61868b6b 0x7f9d61888379 0x7f9cf3723b4a 0x7f9cf37255fa 0x7f9cf5a5578a 0x7f9cf5c9e30b 0x7f9cf5ce5b37 0x7f9cf5a761f0 0x7f9cf5a77158 0x7f9cf5a7cca5 0x7f9cf58685b8 0x7f9cf5dd859a 0x7f9cf5ddd063 0x7f9cf79f7d5a 0x7f9cf5ddd063 0x7f9d3e9cdbc1 0x7f9d3e9cad96 0x55ed620c8409 0x55ed6204fe7a 0x55ed61fe1afa 0x55ed62053d00 0x55ed6204e9ee 0x55ed61fe1bda 0x55ed62050737 0x55ed6204e9ee 0x55ed61fe1bda 0x55ed62050737 0x55ed6204eced 0x55ed61fe1bda 0x55ed62053d00 0x55ed6204eced ^ any help?

deeplearningnrs commented 3 years ago

I tried with ML-100k and now get this error, any help?


field_separator : "\t"
seq_separator : " "
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
TIME_FIELD: timestamp
NEG_PREFIX: neg_
ITEM_LIST_LENGTH_FIELD : item_length
LIST_SUFFIX: _list
MAX_ITEM_LIST_LENGTH: 200
POSITION_FIELD: position_id
load_col : {'inter': ['user_id', 'item_id', 'timestamp']}
gpu_id : 3
min_user_inter_num: 5
min_item_inter_num: 5
gpu_id: 3

epochs: 5
train_window : 100
dupe_number : 1
learning_rate : 0.001
train_batch_size: 256
eval_batch_size: 256
valid_metric: NDCG@10
topk: [5,10,20,50]
eval_setting: TO_LS, pop100
training_neg_sample_num: 0
neg_sampling: ~

!python run_recbole.py --model='BERT4Rec' --dataset='ml-100k' --config_files=test.yaml

RuntimeError: Expected object of scalar type Int but got scalar type Long for sequence element 1 in sequence argument at position #1 'tensors'

Ethan-TZ commented 3 years ago

@deeplearningnrs Hello, thanks for your attention to RecBole!

  1. For the first question,you should add a space between inter: and [user_id,item_id,timestamp]. For example:
  topk : [5,10,20,50]
    epochs : 5
    loss_type : BPR
    metrics: ["Recall", "Precision","Hit", "MRR", "NDCG",  "GiniIndex"]
    valid_metric: MRR@5
    load_col:
      inter: [user_id,item_id,timestamp]
  1. For tcmalloc error,this is because the loaded dataset is too large and exceeds the limit of computing memory.If you run on platforms such as 'Colab', this problem is likely to occur.You can refer to this issue.

I run in the same configuration as you,but 'ValueError' and 'RuntimeError' were not encountered.Can you provide more error information?

deeplearningnrs commented 3 years ago

@chenyuwuxin thanks for the reply. So now I am testing on ml-100k, here are two things, first it does not let me use loss_Type: CE , it says negative sampling should be 0, which I turned on using training_neg_sample_num: 0 (is this correct?)

Anyway for now I shift to BPR and now I get this error

Remain Fields: ['user_id', 'item_id', 'rating', 'timestamp'] 23 Oct 12:58 INFO [Training]: train_batch_size = [2048] negative sampling: [{'uniform': 1}] 23 Oct 12:58 INFO [Evaluation]: eval_batch_size = [4096] eval_args: [{'split': {'LS': 'valid_and_test'}, 'order': 'TO', 'mode': 'full', 'group_by': 'user'}] 23 Oct 12:58 INFO BERT4Rec( (item_embedding): Embedding(1684, 64, padding_idx=0) (position_embedding): Embedding(51, 64) (trm_encoder): TransformerEncoder( (layer): ModuleList( (0): TransformerLayer( (multi_head_attention): MultiHeadAttention( (query): Linear(in_features=64, out_features=64, bias=True) (key): Linear(in_features=64, out_features=64, bias=True) (value): Linear(in_features=64, out_features=64, bias=True) (attn_dropout): Dropout(p=0.5, inplace=False) (dense): Linear(in_features=64, out_features=64, bias=True) (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True) (out_dropout): Dropout(p=0.5, inplace=False) ) (feed_forward): FeedForward( (dense_1): Linear(in_features=64, out_features=256, bias=True) (dense_2): Linear(in_features=256, out_features=64, bias=True) (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.5, inplace=False) ) ) (1): TransformerLayer( (multi_head_attention): MultiHeadAttention( (query): Linear(in_features=64, out_features=64, bias=True) (key): Linear(in_features=64, out_features=64, bias=True) (value): Linear(in_features=64, out_features=64, bias=True) (attn_dropout): Dropout(p=0.5, inplace=False) (dense): Linear(in_features=64, out_features=64, bias=True) (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True) (out_dropout): Dropout(p=0.5, inplace=False) ) (feed_forward): FeedForward( (dense_1): Linear(in_features=64, out_features=256, bias=True) (dense_2): Linear(in_features=256, out_features=64, bias=True) (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.5, inplace=False) ) ) ) ) (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.5, inplace=False) ) Trainable parameters: 211136 Train 0: 100%|█████████████████████████| 48/48 [00:09<00:00, 5.23it/s, GPU RAM: 2.04 G/15.90 G] 23 Oct 12:58 INFO epoch 0 training [time: 9.18s, train loss: 22.3820] Evaluate : 0%| | 0/1 [00:00<?, ?it/s, GPU RAM: 2.04 G/15.90 G] Traceback (most recent call last): File "run_recbole.py", line 25, in run_recbole(model=args.model, dataset=args.dataset, config_file_list=config_file_list) File "/content/RecBole/recbole/quick_start/quick_start.py", line 60, in run_recbole train_data, valid_data, saved=saved, show_progress=config['show_progress'] File "/content/RecBole/recbole/trainer/trainer.py", line 334, in fit valid_score, valid_result = self._valid_epoch(valid_data, show_progress=show_progress) File "/content/RecBole/recbole/trainer/trainer.py", line 196, in _valid_epoch valid_result = self.evaluate(valid_data, load_best_model=False, show_progress=show_progress) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad return func(*args, kwargs) File "/content/RecBole/recbole/trainer/trainer.py", line 462, in evaluate self.eval_collector.eval_batch_collect(scores, interaction, positive_u, positive_i) File "/content/RecBole/recbole/evaluator/collector.py", line 152, in eval_batch_collect result = torch.cat((pos_idx, pos_len_list), dim=1) RuntimeError: Expected object of scalar type Int but got scalar type Long for sequence element 1 in sequence argument at position #1 'tensors'**

I used these parameters n_layers: 2 n_heads: 2 hidden_size: 64 inner_size: 256 hidden_dropout_prob: 0.5 attn_dropout_prob: 0.5 hidden_act: 'gelu' layer_norm_eps: 1e-12 initializer_range: 0.02 mask_ratio: 0.2 training_neg_sample_num: 0 loss_type: 'BPR' epochs: 5

help.

Ethan-TZ commented 3 years ago

@deeplearningnrs If loss_type set to 'CE', the training task is regarded as a multi-classification task and the target item is the ground truth. In this way, negative sampling is not needed. If loss_type set to 'BPR', the training task will be optimized in the pair-wise way, which maximize the difference between positive item and negative item. In this way, negative sampling is necessary, such as setting --neg_sampling="{'uniform': 1}". I ran according to your configuration, but I didn't find any problems.This may be due to the low version of your pytorch.Can you provide the complete yaml file and pytorch version?Our latest version only supports pytorch above 1.7.0.