NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.07k stars 142 forks source link

[BUG] Incorrect scores for evaluation #746

Open SPP3000 opened 10 months ago

SPP3000 commented 10 months ago

Bug Description

It appears that a significant issue is affecting XLNet with CLM and potentially other models. When using the trainer's evaluation method, even after just one training epoch, the NDCG and MRR scores approach near-perfection. Upon inspecting the evaluation process, it seems that the model is able to predict the missing item_id, most likely due to information leakage.

This bug impacts the trainer.evaluation method and, consequently, all eval_steps during training, causing the automatic best model saving procedure to produce incorrect results.

Steps/Code to Reproduce the Bug

To replicate this issue, you can refer to the code provided here, which is based on the Yoochoose e-commerce dataset example.

In 01-ETL-with-NVTabular, the dataset is randomly split into a training and a validation set. The validation set is then duplicated and transformed into a test set that contains the same entries as the validation set, but with the last item removed from each sequence. The transformation has been simplified since the item_ids are the only input feature for the transformer being trained.

In 02-End-to-End-Session-Based-with-Evaluation, an XLNet model is trained for a next-item prediction task. According to your PR, the last item in the sequence is the one to be predicted for evaluation. After training and running the evaluation method, the results show exceptionally high accuracy scores (MRR > 0.95).

To rule out the possibility that the validation scores are inflated due to similarities or identical entries, the trainer class is used to make predictions on the test set (which, as previously stated, is identical to the validation set). Calculating the MRR based on these predictions results in a more reasonable score of MRR ≈ 0.2.

Environment Details

This bug persists across different versions and has been observed in the following environment:

Additional Context

I hope that this issue might be due to a coding mistake on my part. However, if it turns out to be a genuine bug, I recommend addressing it as a high-priority matter. Thank you for your amazing support!