PacktPublishing / Learn-Amazon-SageMaker

Learn Amazon SageMaker
MIT License
102 stars 89 forks source link

training a recommender: train-test split the users rather than the ratings? #10

Open dipetkov opened 3 years ago

dipetkov commented 3 years ago

Section "Training factorization machines with pipe mode" in ch. 9 demonstrates how to train a recommender with pipe mode.

The train-test split is completely random: the units to randomize are the ratings, not the users. This means that most users will contribute to both the training and the test subsets.

Isn't it best practice to split on users instead? So that users are either completely in the training subset or completely in the test test.

This is the relevant code snippet:

X, Y = loadDataset('ml-25m/ratings.csv', num_ratings, num_features)

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.05, random_state=59)