The mapping of the raw IDs of the users to the internal IDs is not correct when the dataset contains more than 25000 rows. I tried to read the ratings from a file and from a dataframe, but it always gives a wrong mapping of the user IDs. I tested several datasets.
In the code below, after saving the training set with the internal IDs to SupriseTrainingSet.csv, I compare the file Train.txt to SupriseTrainingSet.csv.
Steps/Code to Reproduce
from surprise import Dataset, KNNBasic, Reader
import pandas as pd
import csv
Hi,
Description
The mapping of the raw IDs of the users to the internal IDs is not correct when the dataset contains more than 25000 rows. I tried to read the ratings from a file and from a dataframe, but it always gives a wrong mapping of the user IDs. I tested several datasets. In the code below, after saving the training set with the internal IDs to SupriseTrainingSet.csv, I compare the file Train.txt to SupriseTrainingSet.csv.
Steps/Code to Reproduce
from surprise import Dataset, KNNBasic, Reader import pandas as pd import csv
train_file = files_dir + folder + "Train.txt"
reader = Reader(line_format="user item rating", sep="\t")
data = Dataset.load_from_file(train_file, reader=reader)
trainset = data.build_full_trainset() #creates the training set from the whole dataset
with open(files_dir + folder +"SupriseTrainingSet.csv", 'w', newline='') as file: writer = csv.writer(file)
write each row of data to the CSV file
algo = KNNBasic() algo.fit(trainset)
Expected Results
Original dataset
User Item Rating 1 225 2 1 154 5 1 73 3 1 43 4 1 199 4 1 34 2 1 227 4 1 94 2 1 74 1 1 76 4 1 181 5 1 105 2 1 253 5 1 200 3 1 61 4 1 93 5 1 272 3 1 53 3 1 174 5 1 193 4 1 161 4 1 129 5 1 195 5 1 9 5 1 156 4 1 262 3 1 99 3 1 21 1 1 35 1 1 123 4 1 104 1 1 148 2 1 184 4 1 249 4 1 54 3 1 66 4 1 107 4 1 8 1 1 145 2 1 102 2 1 134 4 1 125 3 1 165 5 1 49 3 1 114 5 1 32 5 1 252 2 1 209 4 1 153 3 1 26 3 1 137 5 1 133 4 1 217 3 1 245 2 1 24 3 2 286 4 2 292 4 2 313 5 2 272 5 2 290 3 2 10 2 2 312 3 2 280 3 2 281 3 2 14 4 2 296 3 2 1 4 2 279 4 3 332 1 3 339 3 3 350 3 3 319 2 3 352 2 3 260 4 3 336 1 3 348 4 3 345 3 3 271 3 3 346 5 4 327 5 4 357 4 4 329 5 4 288 4 4 300 5 5 457 1 5 2 3
Internal IDs of surprise
User Item Rating 0 0 2 0 1 5 0 2 3 0 3 4 0 4 4 0 5 2 0 6 4 0 7 2 0 8 1 0 9 4 0 10 5 0 11 2 0 12 5 0 13 3 0 14 4 0 15 5 0 16 3 0 17 3 0 18 5 0 19 4 0 20 4 0 21 5 0 22 5 0 23 5 0 24 4 0 25 3 0 26 3 0 27 1 0 28 1 0 29 4 0 30 1 0 31 2 0 32 4 0 33 4 0 34 3 0 35 4 0 36 4 0 37 1 0 38 2 0 39 2 0 40 4 0 41 3 0 42 5 0 43 3 0 44 5 0 45 5 0 46 2 0 47 4 0 48 3 0 49 3 0 50 5 0 51 4 0 52 3 0 53 2 0 54 3 1 369 4 1 533 5 1 503 3 1 451 1 1 239 4 1 314 4 1 110 4 1 956 4 1 714 4 1 134 4 1 674 4 1 227 5 1 471 1 2 180 5 2 382 5 2 264 4 2 213 3 2 517 1 2 86 1 2 351 5 2 162 5 2 272 2 2 410 4 2 822 2 3 1328 1 3 401 5 3 807 3 3 84 3 3 1074 5 4 415 5 4 589 4
Actual Results
Original dataset
User Item Rating 1 225 2 1 154 5 1 73 3 1 43 4 1 199 4 1 34 2 1 227 4 1 94 2 1 74 1 1 76 4 1 181 5 1 105 2 1 253 5 1 200 3 1 61 4 1 93 5 1 272 3 1 53 3 1 174 5 1 193 4 1 161 4 1 129 5 1 195 5 1 9 5 1 156 4 1 262 3 1 99 3 1 21 1 1 35 1 1 123 4 1 104 1 1 148 2 1 184 4 1 249 4 1 54 3 1 66 4 1 107 4 1 8 1 1 145 2 1 102 2 1 134 4 1 125 3 1 165 5 1 49 3 1 114 5 1 32 5 1 252 2 1 209 4 1 153 3 1 26 3 1 137 5 1 133 4 1 217 3 1 245 2 1 24 3 2 286 4 2 292 4 2 313 5 2 272 5 2 290 3 2 10 2 2 312 3 2 280 3 2 281 3 2 14 4 2 296 3 2 1 4 2 279 4 3 332 1 3 339 3 3 350 3 3 319 2 3 352 2 3 260 4 3 336 1 3 348 4 3 345 3 3 271 3 3 346 5 4 327 5 4 357 4 4 329 5 4 288 4 4 300 5 5 457 1 5 2 3
Internal IDs of surprise
User Item Rating 0 0 2 0 1 5 0 2 3 0 3 4 0 4 4 0 5 2 0 6 4 0 7 2 0 8 1 0 9 4 0 10 5 0 11 2 0 12 5 0 13 3 0 14 4 0 15 5 0 16 3 0 17 3 0 18 5 0 19 4 0 20 4 0 21 5 0 22 5 0 23 5 0 24 4 0 25 3 0 26 3 0 27 1 0 28 1 0 29 4 0 30 1 0 31 2 0 32 4 0 33 4 0 34 3 0 35 4 0 36 4 0 37 1 0 38 2 0 39 2 0 40 4 0 41 3 0 42 5 0 43 3 0 44 5 0 45 5 0 46 2 0 47 4 0 48 3 0 49 3 0 50 5 0 51 4 0 52 3 0 53 2 0 54 3 0 369 4 0 533 5 0 503 3 0 451 1 0 239 4 0 314 4 0 110 4 0 956 4 0 714 4 0 134 4 0 674 4 0 227 5 0 471 1 0 180 5 0 382 5 0 264 4 0 213 3 0 517 1 0 86 1 0 351 5 0 162 5 0 272 2 0 410 4 0 822 2 0 1328 1 0 401 5 0 807 3 0 84 3 0 1074 5 0 415 5 0 589 4
[Uploading results.xlsx…]()
Versions
Windows-10-10.0.22621-SP0 Python 3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)] surprise 1.1.3