Closed deeplearningnrs closed 2 years ago
I tried with ML-100k and now get this error, any help?
field_separator : "\t"
seq_separator : " "
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
TIME_FIELD: timestamp
NEG_PREFIX: neg_
ITEM_LIST_LENGTH_FIELD : item_length
LIST_SUFFIX: _list
MAX_ITEM_LIST_LENGTH: 200
POSITION_FIELD: position_id
load_col : {'inter': ['user_id', 'item_id', 'timestamp']}
gpu_id : 3
min_user_inter_num: 5
min_item_inter_num: 5
gpu_id: 3
epochs: 5
train_window : 100
dupe_number : 1
learning_rate : 0.001
train_batch_size: 256
eval_batch_size: 256
valid_metric: NDCG@10
topk: [5,10,20,50]
eval_setting: TO_LS, pop100
training_neg_sample_num: 0
neg_sampling: ~
!python run_recbole.py --model='BERT4Rec' --dataset='ml-100k' --config_files=test.yaml
RuntimeError: Expected object of scalar type Int but got scalar type Long for sequence element 1 in sequence argument at position #1 'tensors'
@deeplearningnrs Hello, thanks for your attention to RecBole!
inter:
and [user_id,item_id,timestamp]
.
For example: topk : [5,10,20,50]
epochs : 5
loss_type : BPR
metrics: ["Recall", "Precision","Hit", "MRR", "NDCG", "GiniIndex"]
valid_metric: MRR@5
load_col:
inter: [user_id,item_id,timestamp]
I run in the same configuration as you,but 'ValueError' and 'RuntimeError' were not encountered.Can you provide more error information?
@chenyuwuxin thanks for the reply. So now I am testing on ml-100k, here are two things, first it does not let me use loss_Type: CE , it says negative sampling should be 0, which I turned on using training_neg_sample_num: 0 (is this correct?)
Anyway for now I shift to BPR and now I get this error
Remain Fields: ['user_id', 'item_id', 'rating', 'timestamp']
23 Oct 12:58 INFO [Training]: train_batch_size = [2048] negative sampling: [{'uniform': 1}]
23 Oct 12:58 INFO [Evaluation]: eval_batch_size = [4096] eval_args: [{'split': {'LS': 'valid_and_test'}, 'order': 'TO', 'mode': 'full', 'group_by': 'user'}]
23 Oct 12:58 INFO BERT4Rec(
(item_embedding): Embedding(1684, 64, padding_idx=0)
(position_embedding): Embedding(51, 64)
(trm_encoder): TransformerEncoder(
(layer): ModuleList(
(0): TransformerLayer(
(multi_head_attention): MultiHeadAttention(
(query): Linear(in_features=64, out_features=64, bias=True)
(key): Linear(in_features=64, out_features=64, bias=True)
(value): Linear(in_features=64, out_features=64, bias=True)
(attn_dropout): Dropout(p=0.5, inplace=False)
(dense): Linear(in_features=64, out_features=64, bias=True)
(LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
(out_dropout): Dropout(p=0.5, inplace=False)
)
(feed_forward): FeedForward(
(dense_1): Linear(in_features=64, out_features=256, bias=True)
(dense_2): Linear(in_features=256, out_features=64, bias=True)
(LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.5, inplace=False)
)
)
(1): TransformerLayer(
(multi_head_attention): MultiHeadAttention(
(query): Linear(in_features=64, out_features=64, bias=True)
(key): Linear(in_features=64, out_features=64, bias=True)
(value): Linear(in_features=64, out_features=64, bias=True)
(attn_dropout): Dropout(p=0.5, inplace=False)
(dense): Linear(in_features=64, out_features=64, bias=True)
(LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
(out_dropout): Dropout(p=0.5, inplace=False)
)
(feed_forward): FeedForward(
(dense_1): Linear(in_features=64, out_features=256, bias=True)
(dense_2): Linear(in_features=256, out_features=64, bias=True)
(LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.5, inplace=False)
)
)
)
)
(LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.5, inplace=False)
)
Trainable parameters: 211136
Train 0: 100%|█████████████████████████| 48/48 [00:09<00:00, 5.23it/s, GPU RAM: 2.04 G/15.90 G]
23 Oct 12:58 INFO epoch 0 training [time: 9.18s, train loss: 22.3820]
Evaluate : 0%| | 0/1 [00:00<?, ?it/s, GPU RAM: 2.04 G/15.90 G]
Traceback (most recent call last):
File "run_recbole.py", line 25, in
I used these parameters n_layers: 2 n_heads: 2 hidden_size: 64 inner_size: 256 hidden_dropout_prob: 0.5 attn_dropout_prob: 0.5 hidden_act: 'gelu' layer_norm_eps: 1e-12 initializer_range: 0.02 mask_ratio: 0.2 training_neg_sample_num: 0 loss_type: 'BPR' epochs: 5
help.
@deeplearningnrs If loss_type set to 'CE'
, the training task is regarded as a multi-classification task and the target item is the ground truth. In this way, negative sampling is not needed. If loss_type set to 'BPR'
, the training task will be optimized in the pair-wise way, which maximize the difference between positive item and negative item. In this way, negative sampling is necessary, such as setting --neg_sampling="{'uniform': 1}"
.
I ran according to your configuration, but I didn't find any problems.This may be due to the low version of your pytorch.Can you provide the complete yaml
file and pytorch version?Our latest version only supports pytorch above 1.7.0.
I am trying to run on MIND on BERT4REC, I get the .inter file with MIND using the instructions in recbole.io. These are my parameters
but I am getting error
I also get this error:
ValueError: [timestamp] is not exist in interaction [The batch_size of interaction: 5843444 user_id, torch.Size([5843444]), cpu, torch.int64 item_id, torch.Size([5843444]), cpu, torch.int64
I do not get this error with ml-100k
Main issue is the memory crash
tcmalloc: large alloc 9269518336 bytes == 0x55ed84b08000 @ 0x7f9d61868b6b 0x7f9d61888379 0x7f9cf3723b4a 0x7f9cf37255fa 0x7f9cf5a5578a 0x7f9cf5c9e30b 0x7f9cf5ce5b37 0x7f9cf5a560b0 0x7f9cf5a5fd95 0x7f9cf5d99973 0x7f9cf5ddd709 0x7f9d3e7ccf93 0x7f9d3e5da303 0x55ed61fe0544 0x55ed61fe0240 0x55ed62054627 0x55ed61fe1afa 0x55ed6204fc0d 0x55ed6204eced 0x55ed61fe1bda 0x55ed6204fc0d 0x55ed6204eced 0x55ed61fe1bda 0x55ed62053d00 0x55ed6204eced 0x55ed61fe1bda 0x55ed6204fc0d 0x55ed6204e9ee 0x55ed61fe1bda 0x55ed6204f915 0x55ed6204e9ee tcmalloc: large alloc 9269518336 bytes == 0x55efadb56000 @ 0x7f9d61868b6b 0x7f9d61888379 0x7f9cf3723b4a 0x7f9cf37255fa 0x7f9cf5a5578a 0x7f9cf5c9e30b 0x7f9cf5ce5b37 0x7f9cf5a560b0 0x7f9cf5a5fd95 0x7f9cf5d99973 0x7f9cf5ddd709 0x7f9d3e7ccf93 0x7f9d3e5da303 0x55ed61fe0544 0x55ed61fe0240 0x55ed62054627 0x55ed61fe1afa 0x55ed6204fc0d 0x55ed6204eced 0x55ed61fe1bda 0x55ed6204fc0d 0x55ed6204eced 0x55ed61fe1bda 0x55ed62053d00 0x55ed6204eced 0x55ed61fe1bda 0x55ed6204fc0d 0x55ed6204e9ee 0x55ed61fe1bda 0x55ed6204f915 0x55ed6204e9ee tcmalloc: large alloc 9269518336 bytes == 0x55f1d6b70000 @ 0x7f9d61868b6b 0x7f9d61888379 0x7f9cf3723b4a 0x7f9cf37255fa 0x7f9cf5a5578a 0x7f9cf5c9e30b 0x7f9cf5ce5b37 0x7f9cf5a761f0 0x7f9cf5a77158 0x7f9cf5a7cca5 0x7f9cf58685b8 0x7f9cf5dd859a 0x7f9cf5ddd063 0x7f9cf79f7d5a 0x7f9cf5ddd063 0x7f9d3e9cdbc1 0x7f9d3e9cad96 0x55ed620c8409 0x55ed6204fe7a 0x55ed61fe1afa 0x55ed62053d00 0x55ed6204e9ee 0x55ed61fe1bda 0x55ed62050737 0x55ed6204e9ee 0x55ed61fe1bda 0x55ed62050737 0x55ed6204eced 0x55ed61fe1bda 0x55ed62053d00 0x55ed6204eced ^ any help?