This is a repository of public data sources for Recommender Systems (RS).
All of these recommendation datasets can convert to the atomic files defined in RecBole, which is a unified, comprehensive and efficient recommendation library.
After converting to the atomic files, you can use RecBole to test the performance of different recommender models on these datasets easily. For more information about RecBole, please refer to RecBole.
In order to use RecBole, you need to convert these original datasets to the atomic file which is a kind of data format defined by RecBole.
We provide two ways to convert these datasets into atomic files:
Download the raw dataset and process it with conversion tools we provide in this repository. Please refer to conversion tools.
Directly download the processed atomic files. Baidu Wangpan (Password: e272), Google Drive.
Criteo: This dataset was collected from Criteo, which consists of a portion of Criteo's traffic over a period of several days.
Avazu: This dataset is used in Avazu CTR prediction contest.
iPinYou: This dataset was provided by iPinYou, which contains all training datasets and leaderboard testing datasets of the three seasons iPinYou Global RTB(Real-Time Bidding) Bidding Algorithm Competition.
AliEC: Ali_Display_Ad_Click is a dataset of click rate prediction about display Ad, which is displayed on the website of Taobao. The dataset is offered by the company of Alibaba.
Foursquare: This dataset contains check-ins in NYC and Tokyo collected for about 10 month. Each check-in is associated with its time stamp, its GPS coordinates and its semantic meaning.
Gowalla: This dataset is from a location-based social networking website where users share their locations by checking-in, and contains a total of 6,442,890 check-ins of these users over the period of Feb. 2009 - Oct. 2010.
songfacts.com
and last.fm
websites. Items are songs, which are described in terms of textual description extracted from songfacts.com
, and tags from last.fm
.Freesound.org
. Items are sounds, which are described in terms of textual description and tags created by the sound creator at uploading time.Book-Crossing: This dataset was collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. It contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.
GoodReads: This dataset contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, datasets have multiple levels of user interaction, raging from adding to a shelf, rating, and reading.
KDD2010: This dataset was released in KDD Cup 2010 Educational Data Mining Challenge, which contains the situations of students submitting exercises on the systems.
EndoMondo: This is a collection of workout logs from users of EndoMondo. Data includes multiple sources of sequential sensor data such as heart rate logs, speed, GPS, as well as sport type, gender and weather conditions.
Phishing Websites: This dataset contains 30 kinds of features of 11,055 websites and labels of whether they are phishing websites or not. The websites' features includes 12 address-bar based features, 6 abnormal based features, 5 HTML-and-JavaScript based features and 7 domain based features.
Behance: This is a small, anonymized, version of a larger proprietary dataset about likes and image data from the community art website Behance.
DianPing: This dataset contains the user reviews as well as the detailed business meta data information crawled from a famous Chinese online review webset DianPing.com, including the 3,605,300 reviews of 510,071 users towards 209,132 businesses.
Food: These datasets contain recipe details and reviews from Food.com (formerly GeniusKitchen). Data includes cooking recipes and review texts.
SN | Dataset | #User | #Item | #Inteaction | Sparsity | Interaction Type | TimeStamp | User Context | Item Context | Interaction Context |
---|---|---|---|---|---|---|---|---|---|---|
1 | MovieLens | - | - | - | - | Rating | √ | √ | √ | |
2 | Anime | 73,515 | 11,200 | 7,813,737 | 99.05% | Rating [-1, 1-10] |
√ | |||
3 | Epinions | 116,260 | 41,269 | 188,478 | 99.99% | Rating [1-5] |
√ | √ | ||
4 | Yelp (5 versions) |
- | - | - | - | Rating [1-5] |
√ | √ | √ | √ |
5 | Netflix | 480,189 | 17,770 | 100,480,507 | 98.82% | Rating [1-5] |
√ | |||
6 | Book-Crossing | 105,284 | 340,557 | 1,149,780 | 99.99% | Rating [0-10] |
√ | √ | ||
7 | Jester | 73,421 | 101 | 4,136,360 | 44.22% | Rating [-10, 10] |
||||
8 | Douban | 738,701 | 28 | 2,125,056 | 89.73% | Rating [0,5] |
√ | √ | ||
9 | Yahoo Music | 1,948,882 | 98,211 | 11,557,943 | 99.99% | Rating [0, 100] |
√ | |||
10 | KDD2010 | - | - | - | - | Rating | √ | |||
11 | Amazon (2014 & 2018) |
- | - | - | - | Rating [0,5] |
√ | √ | ||
12 | 55,187 | 9,911 | 1,445,622 | 99.74% | - | |||||
13 | Gowalla | 107,092 | 1,280,969 | 6,442,892 | 99.99% | Check-in | √ | √ | ||
14 | Last.FM | 1,892 | 17,632 | 92,834 | 99.72% | Click | √ | |||
15 | DIGINETICA | 204,789 | 184,047 | 993,483 | 99.99% | Click | √ | √ | ||
16 | Steam | 2,567,538 | 32,135 | 7,793,069 | 99.99% | Buy | √ | √ | √ | |
17 | Ta Feng | 32,266 | 23,812 | 817,741 | 99.89% | Click | √ | √ | √ | √ |
18 | Foursquare | - | - | - | - | Check-in | √ | √ | ||
19 | Tmall | 963,923 | 2,353,207 | 44,528,127 | 99.99% | Click/Buy | √ | √ | ||
20 | YOOCHOOSE | 9,249,729 | 52,739 | 34,154,697 | 99.99% | Click/Buy | √ | √ | ||
21 | Retailrocket | 1,407,580 | 247,085 | 2,756,101 | 99.99% | View/Addtocart/Transaction | √ | |||
22 | LFM-1b | 120,322 | 3,123,496 | 1,088,161,692 | 99.71% | Click | √ | √ | √ | √ |
23 | MIND | - | - | - | - | Click | √ | |||
24 | BeerAdvocate | 33,388 | 66,055 | 1,586,614 | 99.9281% | Rating [0,5] |
√ | √ | ||
25 | Behance | 63,497 | 178,788 | 1,000,000 | 99.9912% | Likes | √ | √ | ||
26 | DianPing | 542,706 | 243,247 | 4,422,473 | 99.9967% | Rating [0,5] |
√ | √ | √ | |
27 | EndoMondo | 1,104 | 253,020 | 253,020 | 99.9094% | Workout Logs | √ | √ | √ | |
28 | Food | 226,570 | 231,637 | 1,132,367 | 99.9978% | Rating [0,5] |
√ | √ | ||
29 | GoodReads | 876,145 | 2,360,650 | 228,648,342 | 99.9889% | Rating [0,5] |
√ | √ | ||
30 | KGRec | - | - | - | - | Click | √ | |||
31 | ModCloth | 47,958 | 1,378 | 82,790 | 99.8747% | Rating [0,5] |
√ | √ | √ | |
32 | RateBeer | 29,265 | 110,369 | 2,924,163 | 99.9095% | Overall Rating [0,20] |
√ | √ | √ | |
33 | RentTheRunway | 105,571 | 5,850 | 192,544 | 99.9688% | Rating [0,10] |
√ | √ | √ | √ |
34 | Twitch | 15,524,309 | 6,161,666 | 474,676,929 | 99.9995% | Click | √ | |||
35 | Amazon_M2 | 3,606,349 | 1,410,675 | 15,306,183 | - | Click | √ | √ | ||
36 | Music4All-Onion | 119,140 | 109,269 | 252,984,396 | - | Click | √ | √ | √ |
SN | Dataset | #User | #Item | #Inteaction | Sparsity | Interaction Type | TimeStamp | User Context | Item Context | Interaction Context |
---|---|---|---|---|---|---|---|---|---|---|
1 | Criteo | - | - | 45,850,617 | - | Click | √ | |||
2 | Avazu | - | - | 40,428,967 | - | Click [0, 1] |
√ | √ | ||
3 | iPinYou | 19,731,660 | 163 | 24,637,657 | 99.23% | View/Click | √ | √ | √ | |
4 | Phishing websites | - | - | 11,055 | - | √ | ||||
5 | Adult | - | - | 32,561 | - | income>=50k [0, 1] |
√ | |||
6 | Alibaba-iFashion | 3,569,112 | 4,463,302 | 191,394,393 | 99.9988% | Click | √ | |||
7 | AliEC | 491,647 | 240,130 | 1,366,056 | 99.9988% | Click | √ | √ | √ |
These knowledge-aware recommender datasets are based on KB4Rec, which associate items from recommender systems with entities from Freebase. Note that Amazon-book dataset is the version released in 2014.
Raw datasets information
SN | Dataset | #Items | #Linked-Items | #Users | #Interactions |
---|---|---|---|---|---|
1 | MovieLens | 27,278 | 25,503 | 138,493 | 20,000,263 |
2 | Amazon-book | 2,370,605 | 108,515 | 8,026,324 | 22,507,155 |
3 | LFM-1b (tracks) | 31,634,450 | 1,254,923 | 120,322 | 319,951,294 |
After filtering by 5-core (And filter out the tracks that are listened to less than 10 times in LFM-1b)
SN | Dataset | #Items | #Linked-Items | #Users | #Interactions |
---|---|---|---|---|---|
1 | MovieLens | 18,345 | 18,057 | 138,493 | 19,984,024 |
2 | Amazon-book | 367,982 | 34,476 | 603,668 | 8,898,041 |
3 | LFM-1b (tracks) | 615,823 | 337,349 | 79,133 | 15,765,756 |