[🐛BUG] Unexpected low LB score on Kaggle

petrov826 commented 2 years ago

Describe the bug I got map@12 = 0.148 on eval set by running trainer.evaluate(test_data). But the LB score was only 0.0124.

To Reproduce

Go to this H&M competition on Kaggle.
Run my notebook Please note that it looks more than 3 hours to run.
Submit submission.csv

Expected behavior I'll get much better score. The map@12 is 0.148. My model learned 450,255 users' interactions. And there’re 1,371,980 users in the submission file. If I’m correct, the LB score will be about 0.0485( = 0.148 * 450255 / 1371980).

My guess This gap may coming from these below.

Incorrect config
Incorrect use of full_sort_topk
Data leakage?

**Additional information I've already open a discussion on Kaggle. Here's my original post.

If "details" tags are not recommended here, please let me know. I'll fix it.

It's long. Please click to expand

Hello everyone. I made a [notebook](https://www.kaggle.com/code/peterpetrov826/fork-of-using-recbole/notebook) which uses RecBole. Surprisingly, I got map@12 = 0.148 on eval set by running trainer.evaluate(test_data). But the LB score was only 0.0124. Do you have an idea where does this huge gap come from? As you may know, [RecBole](https://recbole.io/) is an open-source recommendation library. It’s a kind of wrapper of PyTorch and you can build about 80 models easily. Let me share my strategy. It’s difficult to recommend for users who rarely shop. So I extracted users who have bought more than 2 times and use them to train my model. For other users, I recommend popular products. In general, popular products are more likely to be bought, and those that don't will not. So I extract product which have been bought more than 50 times and use them to train my model. For other products, I don’t recommend those at all. (Sorry sewing geniuses) The map@12 is 0.148. My model learned 450,255 users' interactions. And there’re 1,371,980 users in the submission file. If I’m correct, the LB score will be about 0.0485( = 0.148 * 450255 / 1371980). I must have made a mistake somewhere. I’m wondering that there’re some problems in my “making recommendations” section. I found [this awesome notebook](https://www.kaggle.com/code/astrung?scriptVersionId=91596049&cellId=35) may improve my score. But still there will be a huge gap… And my another guess is that due to my strategy, the evaluation process was done in “super easy mode”, but submission process is “extremely hard mode”. Thanks for reading and taking time!

petrov826 commented 2 years ago

A super kind kaggle user gave me an information that MAP of RecBole and that of the competition are different. I'll define custom MAP and check the gap again.

If you have other information, please let me know.

guijiql commented 2 years ago

I checked the formula of MAP in the competition, as shown below. Could you explain the difference between n and 12 ?

petrov826 commented 2 years ago

Thank you for your reply @guijiql!

For most cases, n is 12. In this competition, we are asked to recommend top 12 items per customer. We CAN recommend only top 3 items if we want. But we will lost the chance of getting higher score, so no one do that.

This is the official comment from competition organizer.

There is never a penalty for using the full 12 predictions for a customer that ordered fewer than 12 items; thus, it's advantageous to make 12 predictions for each customer.

guijiql commented 2 years ago

Thank for your explaination. If n is 12, the formula for MAP in the competition is completely the same as the calculation in recbole. There is a typo in the documentation. i.e. min(|\hat R(u)|, K) should be min(|R(u)|, K). I don't think your problem is caused by difference of MAP formula.

petrov826 commented 2 years ago

Thank you too @guijiql

No difference between MAP formula is a great news😎 With MAP@k is calculated correctly, my model is good enough for this competition!

Now, I'm wondering that my usage of full_sort_topk() might be wrong. Is there a way to get top k items by using model.predict() or something? If I use the predicting functions that are used to evaluate my model, the gap might disappear.

petrov826 commented 2 years ago

The competition ended and I got almost the same score as LB one. It seems that both RecBole score and kaggle's LB score are correct.

I still don't know why there's a huge gap between them. But I guess that it is coming from domain shift or something. Because fashion trends are changing rapidly all the time.

Anyway, we should investigate more this issue by using more famous dataset like MovieLens.

RUCAIBox / RecBole

[🐛BUG] Unexpected low LB score on Kaggle #1265