[🐛BUG] Errors of using BERT4REC in custom datasets

MasterEndless commented 1 year ago

Describe the bug I want to run bert4rec on my custom datasets, and I met the following errors: Traceback (most recent call last): File "run_recbole.py", line 54, in run_recbole( File "/home/ubuntu/qhanliu/disk2/RecBole/recbole/quick_start/quick_start.py", line 91, in run_recbole best_valid_score, best_valid_result = trainer.fit( File "/home/ubuntu/qhanliu/disk2/RecBole/recbole/trainer/trainer.py", line 437, in fit train_loss = self._train_epoch( File "/home/ubuntu/qhanliu/disk2/RecBole/recbole/trainer/trainer.py", line 243, in _train_epoch losses = loss_func(interaction) File "/home/ubuntu/qhanliu/disk2/RecBole/recbole/model/sequential_recommender/bert4rec.py", line 160, in calculate_loss seq_output = self.forward(masked_item_seq) File "/home/ubuntu/qhanliu/disk2/RecBole/recbole/model/sequential_recommender/bert4rec.py", line 115, in forward item_seq.size(1), dtype=torch.long, device=item_seq.device File "/home/ubuntu/qhanliu/disk2/RecBole/recbole/data/interaction.py", line 133, in getattr raise AttributeError(f"'Interaction' object has no attribute '{item}'") AttributeError: 'Interaction' object has no attribute 'size'

To Reproduce Steps to reproduce the behavior: Yyaml file:

Atomic File Format

field_separator: "\t" # (str) Separator of different columns in atomic files. seq_separator: " " # (str) Separator inside the sequence features.

dataset config : Sequential Recommendation

gpu_id: -1 benchmark_filename: ['train', 'test', 'test'] USER_ID_FIELD: session_id ITEM_ID_FIELD: item_id load_col: inter: [session_id, item_id, item_id_list] alias_of_item_id: ['item_id_list'] ITEM_LIST_LENGTH_FIELD: item_length LIST_SUFFIX: _list MAX_ITEM_LIST_LENGTH: 27

model config

n_layers: 2 # (int) The number of transformer layers in transformer encoder. n_heads: 2 # (int) The number of attention heads for multi-head attention layer. hidden_size: 64 # (int) The number of features in the hidden state. inner_size: 256 # (int) The inner hidden size in feed-forward layer. hidden_dropout_prob: 0.2 # (float) The probability of an element to be zeroed. attn_dropout_prob: 0.2 # (float) The probability of an attention score to be zeroed. hidden_act: 'gelu' # (str) The activation function in feed-forward layer. layer_norm_eps: 1e-12 # (float) A value added to the denominator for numerical stability. initializer_range: 0.02 # (float) The standard deviation for normal initialization. mask_ratio: 0.2 # (float) The probability for a item replaced by MASK token. loss_type: 'CE' # (str) The type of loss function. transform: mask_itemseq # (str) The transform operation for batch data process. ft_ratio: 0.5 # (float) The probability of generating fine-tuning samples

Training Settings

epochs: 1500 # (int) The number of training epochs. train_batch_size: 2048 # (int) The training batch size. learner: adam # (str) The name of used optimizer. learning_rate: 0.003 # (float) Learning rate. train_neg_sample_args: ~ eval_step: 1 # (int) The number of training epochs before an evaluation on the valid dataset. stopping_step: 10 # (int) The threshold for validation-based early stopping. clip_grad_norm: ~ # (dict) The args of clip_gradnorm which will clip gradient norm of model. weight_decay: 0.0 # (float) The weight decay value (L2 penalty) for optimizers. loss_decimal_place: 4 # (int) The decimal place of training loss. require_pow: False # (bool) Whether or not to perform power operation in EmbLoss. enable_amp: False # (bool) Whether or not to use mixed precision training. enable_scaler: False # (bool) Whether or not to use GradScaler in mixed precision training. transform: ~

My data looks like this: session_id:token item_id_list:token_seq item_id:token 1 2 3 4 5 6 7 8 9 10 1 11 1 12 13 14 15 5915 2 16 17 18 19 20 21 22 23 24 50022

script for running:

python run_recbole.py --model=BERT4Rec --dataset=custom_data --config_files custom_config.yaml

Expected behavior I run GRU4REC and SASREC, and CORE, both works fine

Sherry-XLL commented 1 year ago

Hello @MasterEndless, thanks for your attention to RecBole!

Most of existing works extend the original model with specific sequence transformations or augmentation techniques. Considering this, we add a new variable transform in the dataloader to perform useful transformations for sequential models, which are independent of model implementation.

MaskItemSequence is the token masking operation proposed in natural language processing, which has also been used in recommender systems such as BERT4Rec. For each user's historical sequence, this method replaces a proportion of items by a special symbol [mask] to generate a masked sequence.

Based on your error report, it is possible that there was an issue during the mask conversion of the data, causing BERT4Rec to report an error while other sequential models worked well. You can print the interaction data to check if the converted sequence is normal, or provide a small test dataset for us to reproduce your problem.

Thanks again for your suggestions and feedback.

MasterEndless commented 1 year ago

Sure, I can provide you a small dataset for you to test:

test.inter session_id:token item_id_list:token_seq item_id:token 1 2 3 4 5 6 7 8 9 10 1 11 1 12 13 14 15 5915 2 16 17 18 19 20 21 22 23 24 50022 3 45 46 47 48 49 50 51 52 53 50023 50024 50025 50026 50027 50028 50029 50030 50031 50032 50033 50034 54 54 50035 50036 50037 4 61 62 61 59 60 5 63 64 65 66 67 68 68 69 70 71 50038 6 72 78 74 75 76 77 78 72 73 7 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 50039 8 106 107 108 109 111 111 110 9 358 113 113 112 10 50040 50041 50042 50043 50044 50045 50046 114 50047 50048 50049 50050 50051 50052 50053 50054 114 18226

train.inter session_id:token item_id_list:token_seq item_id:token 2 2 3 4 5 6 7 8 9 10 1 11 1 12 13 14 15 3 16 17 18 19 20 21 22 23 24 4 25 26 27 28 29 30 31 5 32 33 31 31 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 6 55 56 57 58 59 60 61 7 61 62 61 59 8 63 64 65 66 67 68 9 63 64 65 66 67 68 68 69 10 63 64 65 66 67 68 68 69 70 11 63 64 65 66 67 68 68 69 70 71

Thanks very much for your great help!

Sherry-XLL commented 1 year ago

Hello @MasterEndless, thanks for your timely feedback!

I create a dataset named test_bert4rec based on the information provided, and create a configuration file named test_bert4rec.yaml as follows:

# dataset config
field_separator: "\t"
seq_separator: " "
USER_ID_FIELD: session_id
ITEM_ID_FIELD: item_id
RATING_FIELD: rating
TIME_FIELD: timestamp
NEG_PREFIX: neg_
ITEM_LIST_LENGTH_FIELD: item_length
LIST_SUFFIX: _list
MAX_ITEM_LIST_LENGTH: 50
POSITION_FIELD: position_id
load_col:
    inter: [session_id, item_id_list, item_id]

benchmark_filename: [train, valid, test]
alias_of_item_id: [item_id_list]

# training and evaluation
epochs: 500
train_batch_size: 4096
eval_batch_size: 40960000
valid_metric: NDCG@10
eval_args:
    split: {'LS': 'valid_and_test'}
    mode: full
    order: TO

# disable negative sampling
train_neg_sample_args: ~

The command I used is python run_recbole.py --model=BERT4Rec --dataset=test_bert4rec --config_files=test_bert4rec.yaml, and BERT4Rec worked well on this demo dataset.

I notice that you have set transform: ~ in your configurations. However, when it comes to BERT4Rec, we have set the default parameter transform: mask_itemseq in recbole/properties/model/BERT4Rec.yaml. The improper configurations lead to the AttributeError you mentioned.

Please make sure that you have installed the latest version of RecBole, and have the proper configurations.

Thanks for your attention to RecBole, and feel free to contact us if you have further questions.

MasterEndless commented 1 year ago

Thanks for your great help! Problems solved. Before closing this issue, I have one further issue looking forward to your answers. The below tables show the testing performance on my own datasets (the format is the same as the demo datasets I provide above), I found CORE achieves a much higher performance compared with the other methods (x10 times), do you have any idea on why the performance difference is so high? Looking forward to your reply! Thanks again~

Sherry-XLL commented 1 year ago

Hello @MasterEndless, I'm glad to see that your last problem has been solved.

In terms of the model performance, the reasons for it are complicated:

Firstly, it is necessary to carefully adjust the hyper-parameters through parameter search to achieve the optimal performance of each model. If you always use the default settings for all datasets, you cannot ensure that the results of more complex models such as GRU4Rec are suitable for your custom dataset.
Secondly, from the numerical results of your model, it can be seen that the sparsity and size of your dataset may not match those of classic datasets. You can try to pre-process your dataset during data processing to reduce the impact of noise and abnormal data.
Finally, compared to other sequential recommendation models that rely on non-linear encoders, the advantages of CORE are two-fold as summarized in the original paper:
- CORE designs a representation-consistent encoder that takes the linear combination of input item embeddings as session embedding, guaranteeing that sessions and items are in the same representation space.
- CORE proposes a robust distance measuring method to prevent overfitting of embeddings in the consistent representation space." Therefore, a simple and effective CORE model performs better on your dataset, while other complex models may exhibit overfitting phenomena.

In summary, there are many adjustments to align the performance of different models, such as modifying hyper-parameters and pre-processing the dataset. I hope my answer can help you.

Thanks again for your attention to RecBole!

MasterEndless commented 1 year ago

Thanks very much for your professional answers! After performing run_hyper.py to tune the hyper-parameters, other models still performs poorly, I guess the possible reasons are that our data is two sparse. I will try other methods as you suggest, thanks again for your prompt answers. Hope you have a nice day!

RUCAIBox / RecBole