RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.44k stars 615 forks source link

[🐛BUG] Different default run_recbole results for v.1.2 and v0.2.0 #699

Closed jlbezares closed 3 years ago

jlbezares commented 3 years ago

Describe the bug We are getting different score results using default run_recbole for the two Recbole versions (v0.2.0. and v0.1.2.).

To Reproduce Use Recbole v.0.1.2 and v.0.2.0): from recbole.quick_start import run_recbole run_recbole(model= 'Pop', dataset='ml-100k')

Expected behavior I understand that default score results should be equal (at least very similar) for different Recbole versions.

Screenshots Please, find attached an image with the default score results for Pop, ml-100k using v0.1.2 and 0.2.0 versions. It displays the run_recbole results for diferent devices and Recbole versions (RecBole_Different_Score_RBVersions.jpg) RecBole_Different_Score_RBVersions

Desktop:

tsotfsk commented 3 years ago

Hi @jlbezares. I am so sorry that I didn't explain this problem in detail in the 0.2, which brings trouble to your experiment. In fact, in 0.2, we changed the method of ranking the same values for items, so that users can more easily find their code errors and better use our HypOpt module . You can see it in PR #658 and issue #622 . When we submit the PR, we selectively test several kinds of neural network models, but unfortunately, we ignore Pop.

As it is known to us, Pop is a rule-based and non personalized item recommendation algorithm. For the ml-100k dataset, there are only 32 different values for the frequency of items. This means that, unlike the deep neural network model, Pop has a large number of items with the same score. Then the rank method of these items with the same score will become a problem. At first, we used the method min-rank which means selecting the min rank value of the same score, while in 0.2, we used the method of max-rank which will evaluate the effect of the model more reasonably.

For Example, if there are five items which have scores [0,0,0,0,0]. in 0.1.2, they all rank 1st, but in 0.2, they all rank 5th. In this case, the latter is obviously more reasonable than the former. Besides, if there are five items which have scores [1,1,1,1,1], they are still 5th. Because we think that the model does not distinguish positive items from negative items.

In fact, our members are also actively discussing this issue. and we will consider further optimizing this issue in later versions, so that users can set the rank method they like under the same score condition just like pandas.Series.rank.

BTW, according to my personal test results, between 0.1.2 and 0.2 versions, such as BPR (MF), NeuMF and other neural network models, the results are equal. Moreover, for large-scale datasets, such as Yelp, even pop, the results are also equal. Therefore, this problem only occurs when a large number of items have the same score and it maybe not common in neural network models.