Accuracy estimation (doc instructions vs train_test_split)

cjjohanson commented 4 years ago

Description

I'm looking for info about how to implement a training set and a test set but I'm a little confused/curious. Oh, let me preface this with "I am a data science student and am fully aware that I could be a total moron by asking this."

..anyways, according to this link (https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-save-some-data-for-unbiased-accuracy-estimation), you use the following but of code to shuffle then split your data:

`

shuffle ratings if you want

random.shuffle(raw_ratings)

A = 90% of the data, B = 10% of the data

threshold = int(.9 * len(raw_ratings)) A_raw_ratings = raw_ratings[:threshold] B_raw_ratings = raw_ratings[threshold:]`

In my limited experience, I've done that with train_test_split in sklearn, which looks like you have in the package at this link (https://surprise.readthedocs.io/en/stable/model_selection.html?highlight=train%20test#surprise.model_selection.split.train_test_split).

What is the difference between using the code in the snippet and using the function that I provided the link to? They seem to be the same, but my gut tells me that they could very well not be since you have these two different methods in the docs.

Again, apologies if that wastes anyones time, just trying to figure out the best way to use the library.

Thanks in advance, CJ

cjjohanson commented 4 years ago

Oh and so you know I actually looked this up beforehand, the following function:

surprise.model_selection.validation.cross_validate(algo, data, measures=[u'rmse', u'mae'], cv=None, return_train_measures=False, n_jobs=1, pre_dispatch=u'2*n_jobs', verbose=False)

does mention return_train_measures. I'm looking to test on the test set at the very end, similar to what you all did in your "How to Save Some Data for Unbiased Accuracy Estimation." I don't know how return_train_measures would solve that, according to the docs, if at all.

NicolasHug commented 4 years ago

A common rule for data splitting is that you should have separate data for model training, model selection, and model evaluation. Usually, people use train_data, test_data = train_test_split(...) to either do

model training on train_data and model selection on test_data (no evaluation)
or model training train_data and model evaluation on test_data (no selection)

If you want to do all 3, you will need to either:

have 3 separate sets: train_data, test_data, val_data
Just use train_data and test_data, and do training + model selection on train_data using CV (e.g. via GridSearch) and finally do the evaluation on val_data.

(Note: sometimes the usage of the terms "test" and "val" are reversed, it's not super important, what matters is that they're disjoint sets)

To have 3 separate datasets, in sklearn, you can just call train_test_split 2 times. Unfortunately it's not possible in surprise because what train_test_split returns isn't the same data structure that it accepts (things are easier in sklearn where everything is a numpy array, but not in surprise). Other than that, they do the same thing.

The entry in the FAQ that you mention is kind of a hack so that one can have 3 different sets train, test, and val (again the way to do this in sklearn would be to just call train_test_split 2 times)

I don't know how return_train_measures would solve that, according to the docs, if at all.

You should never use the training scores as an indication of model performance on new data, and you should not decide which model is best based on their training score either (this is why we need 3 different sets (or 2 + CV)). The training scores are useful to inspect how the model is behaving, but that's pretty much it.

Hope this is clear!

cjjohanson commented 4 years ago

You! Rock!

That is super clear and thank you for the insanely fast response!

Have a great day!

NicolasHug / Surprise