NicolasHug / Surprise

A Python scikit for building and analyzing recommender systems
http://surpriselib.com
BSD 3-Clause "New" or "Revised" License
6.28k stars 1k forks source link

Wrong mapping of the raw IDs to the internal IDs #465

Open benhaf opened 1 year ago

benhaf commented 1 year ago

Hi,

Description

The mapping of the raw IDs of the users to the internal IDs is not correct when the dataset contains more than 25000 rows. I tried to read the ratings from a file and from a dataframe, but it always gives a wrong mapping of the user IDs. I tested several datasets. In the code below, after saving the training set with the internal IDs to SupriseTrainingSet.csv, I compare the file Train.txt to SupriseTrainingSet.csv.

Steps/Code to Reproduce

from surprise import Dataset, KNNBasic, Reader import pandas as pd import csv

train_file = files_dir + folder + "Train.txt"

reader = Reader(line_format="user item rating", sep="\t")

data = Dataset.load_from_file(train_file, reader=reader)

trainset = data.build_full_trainset() #creates the training set from the whole dataset

with open(files_dir + folder +"SupriseTrainingSet.csv", 'w', newline='') as file: writer = csv.writer(file)

write each row of data to the CSV file

for row in trainset.all_ratings():
    writer.writerow(row) 

algo = KNNBasic() algo.fit(trainset)

Expected Results

Original dataset

User Item Rating 1 225 2 1 154 5 1 73 3 1 43 4 1 199 4 1 34 2 1 227 4 1 94 2 1 74 1 1 76 4 1 181 5 1 105 2 1 253 5 1 200 3 1 61 4 1 93 5 1 272 3 1 53 3 1 174 5 1 193 4 1 161 4 1 129 5 1 195 5 1 9 5 1 156 4 1 262 3 1 99 3 1 21 1 1 35 1 1 123 4 1 104 1 1 148 2 1 184 4 1 249 4 1 54 3 1 66 4 1 107 4 1 8 1 1 145 2 1 102 2 1 134 4 1 125 3 1 165 5 1 49 3 1 114 5 1 32 5 1 252 2 1 209 4 1 153 3 1 26 3 1 137 5 1 133 4 1 217 3 1 245 2 1 24 3 2 286 4 2 292 4 2 313 5 2 272 5 2 290 3 2 10 2 2 312 3 2 280 3 2 281 3 2 14 4 2 296 3 2 1 4 2 279 4 3 332 1 3 339 3 3 350 3 3 319 2 3 352 2 3 260 4 3 336 1 3 348 4 3 345 3 3 271 3 3 346 5 4 327 5 4 357 4 4 329 5 4 288 4 4 300 5 5 457 1 5 2 3

Internal IDs of surprise

User Item Rating 0 0 2 0 1 5 0 2 3 0 3 4 0 4 4 0 5 2 0 6 4 0 7 2 0 8 1 0 9 4 0 10 5 0 11 2 0 12 5 0 13 3 0 14 4 0 15 5 0 16 3 0 17 3 0 18 5 0 19 4 0 20 4 0 21 5 0 22 5 0 23 5 0 24 4 0 25 3 0 26 3 0 27 1 0 28 1 0 29 4 0 30 1 0 31 2 0 32 4 0 33 4 0 34 3 0 35 4 0 36 4 0 37 1 0 38 2 0 39 2 0 40 4 0 41 3 0 42 5 0 43 3 0 44 5 0 45 5 0 46 2 0 47 4 0 48 3 0 49 3 0 50 5 0 51 4 0 52 3 0 53 2 0 54 3 1 369 4 1 533 5 1 503 3 1 451 1 1 239 4 1 314 4 1 110 4 1 956 4 1 714 4 1 134 4 1 674 4 1 227 5 1 471 1 2 180 5 2 382 5 2 264 4 2 213 3 2 517 1 2 86 1 2 351 5 2 162 5 2 272 2 2 410 4 2 822 2 3 1328 1 3 401 5 3 807 3 3 84 3 3 1074 5 4 415 5 4 589 4

Actual Results

Original dataset

User Item Rating 1 225 2 1 154 5 1 73 3 1 43 4 1 199 4 1 34 2 1 227 4 1 94 2 1 74 1 1 76 4 1 181 5 1 105 2 1 253 5 1 200 3 1 61 4 1 93 5 1 272 3 1 53 3 1 174 5 1 193 4 1 161 4 1 129 5 1 195 5 1 9 5 1 156 4 1 262 3 1 99 3 1 21 1 1 35 1 1 123 4 1 104 1 1 148 2 1 184 4 1 249 4 1 54 3 1 66 4 1 107 4 1 8 1 1 145 2 1 102 2 1 134 4 1 125 3 1 165 5 1 49 3 1 114 5 1 32 5 1 252 2 1 209 4 1 153 3 1 26 3 1 137 5 1 133 4 1 217 3 1 245 2 1 24 3 2 286 4 2 292 4 2 313 5 2 272 5 2 290 3 2 10 2 2 312 3 2 280 3 2 281 3 2 14 4 2 296 3 2 1 4 2 279 4 3 332 1 3 339 3 3 350 3 3 319 2 3 352 2 3 260 4 3 336 1 3 348 4 3 345 3 3 271 3 3 346 5 4 327 5 4 357 4 4 329 5 4 288 4 4 300 5 5 457 1 5 2 3

Internal IDs of surprise

User Item Rating 0 0 2 0 1 5 0 2 3 0 3 4 0 4 4 0 5 2 0 6 4 0 7 2 0 8 1 0 9 4 0 10 5 0 11 2 0 12 5 0 13 3 0 14 4 0 15 5 0 16 3 0 17 3 0 18 5 0 19 4 0 20 4 0 21 5 0 22 5 0 23 5 0 24 4 0 25 3 0 26 3 0 27 1 0 28 1 0 29 4 0 30 1 0 31 2 0 32 4 0 33 4 0 34 3 0 35 4 0 36 4 0 37 1 0 38 2 0 39 2 0 40 4 0 41 3 0 42 5 0 43 3 0 44 5 0 45 5 0 46 2 0 47 4 0 48 3 0 49 3 0 50 5 0 51 4 0 52 3 0 53 2 0 54 3 0 369 4 0 533 5 0 503 3 0 451 1 0 239 4 0 314 4 0 110 4 0 956 4 0 714 4 0 134 4 0 674 4 0 227 5 0 471 1 0 180 5 0 382 5 0 264 4 0 213 3 0 517 1 0 86 1 0 351 5 0 162 5 0 272 2 0 410 4 0 822 2 0 1328 1 0 401 5 0 807 3 0 84 3 0 1074 5 0 415 5 0 589 4

[Uploading results.xlsx…]()

Versions

Windows-10-10.0.22621-SP0 Python 3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)] surprise 1.1.3