Clarification on BERT4Rec

AoiDragon commented 1 year ago

Some issues were posted that the previous version of BERT4Rec in RecBole mismatched the original results (#1132, #1443). The test scores are much lower than the paper reported and other developers' implementations.

After careful checking, we found that the previous version of BERT4Rec didn't append [MASK] at the end of sequences during training process, which works as a 'fine-tuning' for sequential recommendation. The original description about this part is as follows:

As described above, we create a mismatch between the training and the final sequential recommendation task since the Cloze objective is to predict the current masked items while sequential recommendation aims to predict the future. To address this, we append the special token "[mask]" to the end of user's behavior sequence, and then predict the next item based on the final hidden representation of this token. To better match the sequential recommendation task (i.e., predict the last item), we also produce samples that only mask the last item in the input sequences during training. It works like fine-tuning for sequential recommendation and can further improve the recommendation performances.

We have fixed this bug in #1522 and found a large performance gain on the modified version. Our implementation now completely follows the original paper, but there may exist slight numerical differences on results due to different parameter configurations and hardwares.

We also notice that the paper mentioned in #1443 reported that replacing BERT with RoBERTa can bring better results. However, we currently focus on reproducing the original method, so we decide not to incorporate other possible improvement as suggested in "A Systematic Review and Replicability Study of BERT4Rec for Sequential Recommendation".

We are sorry for any inconvenience caused by this issue and will continue to improve RecBole.

asash commented 1 year ago

Hi, I am the author of the reproducibility paper mentioned in #1443 It is great that you have improved your implementation and of BERT4Rec.

However, the on inspection of your code and the original code I see that your implementations don't match.

In the original code, they mask instances randomly, however they also add 10% of these fine-tuning examples where ONLY last instance is masked (corresponding code: https://github.com/FeiSun/BERT4Rec/blob/615eaf2004abecda487a38d5b0c72f3dcfcae5b3/gen_data_fin.py#L284). So in fact they have two types of training samples: random-masked and only last-masked.

In your code you always mask random AND last instance, which may be a viable approach, but it doesn't exactly match to the original paper.

In my experiments for the paper I found that at least for 3/4 their test datasets I was able to reproduce their results without these special samples. For the fourth dataset (beauty) I was not able to reproduce their results even with their original implementation.

Ethan-TZ commented 1 year ago

@asash Hello, thank you for pointing out the issue. After our inspection, we found that the position embedding of the last item in the previous test sequence was not correctly trained, resulting in a lower performance of the model. Although we solved this problem by always masking the last item, there is still a difference from the implementation of the original paper. After a series of improvements, our implementation is now largely consistent with the original paper, as shown in #1639. In our experiment results, our implementation was able to achieve a relatively good performance without those fine-tuning examples. Furthermore, we also support arbitrary adjustments in the proportion of these samples to further enhance the model performance.

asash commented 1 year ago

Thanks for fixing this! Your library is very popular in academic research, so it is really important to keep this aligned with the original implementations.

AbdElrahmanMostafaRifaat1432 commented 1 year ago

can anyone here help me with my problem in #1670 any help will be appreciated from you

RAzDva1 commented 1 year ago

@AoiDragon, @chenyuwuxinyou Hello, you are using sequential Dataset for BERT4Rec in your implementation. And it uses data_augmentation which wasn't used in original paper. Also, it creates a lot of additional samples which may be on of the reason for such a long training time and inconsistencies with the estimates in the original paper.

night18 commented 1 year ago

Is there any way to avoid data_augmentation for BERT4Rec? Because BERT4Rec set the question as the Cloze question to avoid data augmentation which makes the data too large.

I tried to use benchmark_filename to provide the custom training data. I have read #1624, but still don't understand how to mask the dataset and train the data.

Sherry-XLL commented 1 year ago

Hello @night18, you can refer to https://github.com/RUCAIBox/RecBole/issues/1809#issuecomment-1675802504 and https://github.com/RUCAIBox/RecBole/issues/1824#issuecomment-1675796564 for more details about using benchmark_filename to train BERT4Rec without data augmentation. Thanks for your attention to RecBole!

RUCAIBox / RecBole

Clarification on BERT4Rec #1530