How the datasets are generated?

casperhansen / NeuHash-CF

Content-aware Neural Hashing for Cold-start Recommendation. SIGIR 2020

19 stars 6 forks source link

Thanks for the wonderful work.

The code is clear and well-formatted. However, I wonder if it was possible for you to introduce more about the dealing of the dataset.

Specifically, how to make a dataset become a "cold" one?

From my understanding, the cold-start user or item should have very little rating information. However, the paper says that "We remove users who have rated fewer then 20 items, as well items that have been rated by fewer than 20 users", which was done in all datasets.

So how many ratings the cold-start users/items will have? If the number is as large as non-cold-start users/items, it will be really strange.

It will be much appreciated if you can share the dataset processing code.

Thank you!

Hi,

Thanks for your question. I will update the repo as soon as possible, but in the meantime here is a link to the cold start processing code: https://www.dropbox.com/s/bctaremu0rhr44w/txt2mat-coldstart.py?dl=0

In the cold-start setting, we split the item id's such that no testing items occur during training, whereas for the standard setting we split the items associated to each user. So, how the data is split depends on the setting, but the train+val and test sets still contain the same number of ratings in both settings (as splitting globally by 50% for testing, or locally per user by 50%, gives the same total number). Note that the removal of the users/items with less than 20 items/users was done in both cold-start and the standard setting.

I hope this answers your question.

casperhansen / NeuHash-CF

How the datasets are generated? #1