Unexpected RMSE Differences in SVD Models with almost the same Training Data

Description

Issue Summary

I am encountering significantly different RMSE values when evaluating two SVD models using the Surprise library. Both models are nearly identical in configuration and training data, with the only difference being that one model is trained on the entire dataset (model_full), while the other is trained on almost the entire dataset, except for one sample (model_cv).

Steps to Reproduce

Generate artificial datasets train_ratings and test_ratings using a function generate_dataset. The function generate_dataset use the formulations of surprise.prediction_algorithms.SVD to generate an artificial dataset: $r_{u i}=\mu+b_u+b_i+q_i^T p_u$
Train two SVD models:
- model_full on the entire train_ratings.
- model_cv on train_ratings minus one sample.
Evaluate both models on test_ratings.

python code

train_ratings, test_ratings, _ = generate_dataset(num_users=400,
                                                  num_items=400,
                                                  num_factors=7,
                                                  global_mean=3.5,
                                                  upper_bound=5,
                                                  lower_bound=1,
                                                  sparsity_ratio=0.8, 
# This means train_ratings have (400*400*0.2) samples of the user-item ratings and test_ratings have the remaining (400*400*0.8).
                                                  seed=0)

# train_ratings, test_ratings are both dataframes that consist of 3 columns: 'user_id', 'item_id', and 'rating'.

testset = [tuple(row) for row in test_ratings.itertuples(index=False)]

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(train_ratings, reader)
trainset_cv, valset_cv = surprise.model_selection.train_test_split(data,test_size=0.0000001) 
# the valset only contains one sample
trainset_full = data.build_full_trainset()

model_cv = SVD(n_factors=7,random_state=0,reg_all=0)
model_cv.fit(trainset_cv)
pred_by_cvmodel = model_cv.test(testset)
accuracy.rmse(pred_by_cvmodel,verbose=True)

model_full = SVD(n_factors=7,random_state=0,reg_all=0)
model_full.fit(trainset_full)
pred_by_fullmodel = model_full.test(testset)
accuracy.rmse(pred_by_fullmodel,verbose=True)

output

RMSE: 1.2256
RMSE: 0.6395

The RMSE values are significantly different and I can not figure out the reason. I have tried other cross validation iterator such as surprise.model_selection.KFold, and got the same behavior. Is there maybe a potential problem with the way that cross validation iterator handles the training data?

This issue can also be reproduced using the movielens 100k dataset instead of simulated data, although the RMSE difference is not that large.

python code

data_file_path = './data/ml-100k/u.data'  
ratings = pd.read_csv(data_file_path, sep='\t', names=['user_id', 'item_id', 'rating', 'timestamp'])

train_ratings, test_ratings = train_test_split(ratings.iloc[:,:3],test_size=0.2,random_state=0)

testset = [tuple(row) for row in test_ratings.itertuples(index=False)]

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(train_ratings, reader)
trainset_cv, valset_cv = surprise.model_selection.train_test_split(data,test_size=0.000001)
trainset_full = data.build_full_trainset()

model_cv = SVD(n_factors=100,random_state=0,reg_all=0)
model_cv.fit(trainset_cv)
pred_by_cvmodel = model_cv.test(testset)
accuracy.rmse(pred_by_cvmodel,verbose=True)

model_full = SVD(n_factors=100,random_state=0,reg_all=0)
model_full.fit(trainset_full)
pred_by_fullmodel = model_full.test(testset)
accuracy.rmse(pred_by_fullmodel,verbose=True)

output

RMSE: 0.9550
RMSE: 0.9516

Any suggestions or solutions to this phenomenon would be greatly appreciated!

NicolasHug / Surprise