some problems about metric

xxijian commented 1 year ago

Dear author,

My research direction is session-based recommendation. After reading your paper, I have a few small questions that I hope you can help me answer. Thank you so much in advance:

In the code, the evaluation metric is MRR, but it is output as "MMR." This is an output error in both the SR-GNN and GCE-GNN codes, and should be corrected to MRR, correct?

In the GCE-GNN code, after calculating the average of the hit rate, it is multiplied by 100, which should represent HR (Hit Recall), right? However, the model's print output writes "Recall". Additionally, in the papers for both SR-GNN and GCE-GNN, the evaluation metric used is P (precision), while in your paper, the evaluation metric used is HR. I would like to know how you obtained your HR metric and whether it was not written in the code. Do you know how they derived the P metric in their papers?

I'm really puzzled by these metrics, and have been struggling with them for a long time. Your paper has been very helpful to me in my work, and I would greatly appreciate it if you could help me answer these questions. Thank you very much!

YuhanZhen commented 1 year ago

Thank you for your questions. I greatly apprecate pointing out the typo errors for both our work and SR-GNN. Yes, you are right, the correct evaluation metric is MRR not MMR. I will correct it right away. As for the second question, in my perspective, the precision, recall and hit rate are the different evaluation metrics in mathmatics. However, in many papers with regard to the next item recommendation, we do not strictly distinguish them. That is to say, for the next one item recommendation, they express the same meaning actually. I am so sorry to confused you. Hopefully, I could answer your questions.

xxijian commented 1 year ago

First of all, thank you for your timely response to my question about the evaluation metrics. I understand the issue now. However, I have encountered some problems while reproducing the model:

I ran your data set twice on two different servers without changing any hyperparameters, and found that the results were not normal. The following are my replication results: For the Diginetica dataset First run: 2023/05/11 06:59:21 - main - INFO - 163 - main - Best Result: 2023/05/11 06:59:21 - main - INFO - 164 - main - Recall@20: 74.5342 MRR@20: 50.7897 Epoch: 6, 8 Second run： 2023/05/11 00:34:51 - main - INFO - 165 - main - Best Result: 2023/05/11 00:34:51 - main - INFO - 166 - main - Recall@20: 74.2713 MRR@20: 50.6816 Epoch: 8, 11 For the Gowalla dataset First run： 2023/05/10 19:07:27 - main - INFO - 165 - main - Best Result: 2023/05/10 19:07:27 - main - INFO - 166 - main - Recall@20: 70.6783 MRR@20: 54.5618 Epoch: 5, 5 Second run： 2023/05/11 06:12:20 - main - INFO - 163 - main - Best Result: 2023/05/11 06:12:20 - main - INFO - 164 - main - Recall@20: 69.8523 MRR@20: 52.9815 Epoch: 5, 8

For the yoochoose1_64 dataset First run： 2023/05/11 03:13:01 - main - INFO - 165 - main - Best Result: 2023/05/11 03:13:01 - main - INFO - 166 - main - Recall@20: 18.9917 MRR@20: 7.7932 Epoch: 31, 31 Second run： 2023/05/11 11:28:09 - main - INFO - 163 - main - Best Result: 2023/05/11 11:28:09 - main - INFO - 164 - main - Recall@20: 24.7343 MRR@20: 10.2891 Epoch: 43, 46 The performance of the first two datasets is much higher than other baseline algorithms, and your paper does not indicate such high metrics. However, you did mention relatively high figures. Why didn't you include these metrics in your paper? Additionally, the training results you posted on Github for the Yoochoose1_64 dataset are very different from mine. My results did not even reach one-third of some baseline models, while your posted results are much higher than other baseline algorithms and the metrics in your paper. 2.You made improvements based on the code of sr-gnn, but its code shuffles randomly when obtaining slices, while in your source code, that part is commented out and there is no random shuffling. This is the main reason for your high performance because if a session like [1,2,3,4,5] is split into [1,2], [1,2,3], etc., without shuffling, it will cause problems during training and prediction, for example, if the user's clicks are [1,2,3,4,5] and we want to predict their sixth click, the data augmentation operation also needs to be performed during prediction. If the order is not shuffled, it will cause data leakage when predicting the sub-sessions after splitting, which means that if we want to predict [1,2]->"?", it will easily predict 3 because it was split from [1,2,3,4,5], resulting in very high recommendation performance. Below, I pasted the results when I shuffled the data in the same way as the SR-GNN model code (not commenting out np.random.shuffle(shuffled_arg) on line 58 of utils.py): diginetica: 2023/05/11 16:15:19 - main - INFO - 165 - main - Best Result: 2023/05/11 16:15:19 - main - INFO - 166 - main - Recall@20: 50.0986 MRR@20: 16.5251 Epoch: 8, 8 Gowalla: 2023/05/11 13:46:22 - main - INFO - 165 - main - Best Result: 2023/05/11 13:46:22 - main - INFO - 166 - main - Recall@20: 50.5189 MRR@20: 25.1143 Epoch: 8, 8 yoochoose1_64: 2023/05/11 13:08:45 - main - INFO - 165 - main - Best Result: 2023/05/11 13:08:45 - main - INFO - 166 - main - Recall@20: 69.7807 MRR@20: 30.2592 Epoch: 5, 6 This approach can almost match the performance of SR-GNN, but even worse in some cases. Please explain these issues because this is a paper published at WSDM conference, and will have a significant impact on follow-up research in this field.

xxijian commented 1 year ago

please explain it . thanks.

YuhanZhen commented 1 year ago

Sorry for the late reply, thanks for your notification. I will check it.

YuhanZhen / WSDM23-DGNNs--for-Session-based-Recommendation

some problems about metric #3