CH02: sampling test_set

SiddharthChillale commented 5 years ago

I have a slight doubt in the part as to why are we sampling the test set and not the train set. Wasn't the point of sampling to get a training set that is representative of the cases we want to generalize to ?(book page 24)

what does it matter if the test_set is sampled, afterall we are only testing our model on it ? ( Correct me if I'm wrong please)

Basically, in this code why are we using sampled test_set and not the sampled train_set for comparing in compare_props ?

def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()

ageron commented 5 years ago

Hi @SiddharthChillale ,

Thanks for your question, and sorry for the late response (I was on vacation).

Here's the purpose of the different sets:

training set: train the best model possible, i.e., the model that will generalize best.
validation set (also called the dev set): evaluate different models to choose the one that is most likely to generalize well.
test set: evaluate the final model to estimate its generalization error.

Both the validation set and the test set should be as close as possible to the data that the model will see in production. You can think of them as setting the target: if they are not well aligned with the production data, then it is unlikely that your model will perform well when it's launched to production.

Of course, if the training data is not well aligned with the production data, then it's unlikely that your model will manage to get good performance on the validation set and the test set. So the training data should also be as close as possible to the production data. However, it's slightly less important, as you can always tweak your model and/or the training data many times until you get a satisfying model (as evaluated on the dev set and test set).

If you want to learn more about the discrepancy between the training data and the production data, how you can measure it (using a new held out set called the train-dev set), I encourage you to check out this deeplearning.ai video by Andrew Ng (and the following videos in the series): https://www.youtube.com/watch?v=1waHlpKiNyY

Coming back to your questions:

it is most important to ensure that the dev set and test set are well aligned with the production data, so in the housing example this may mean using stratified sampling for some features. For example, if we assume that the median income is very important for predictions, then we may want to use stratified sampling to ensure that the test set has a distribution of incomes very close to the full dataset's distribution (and hopefully very close to the production data's distribution).
ensuring that the training set is also similar to the production data is also important, but somewhat less so. If it is not well aligned, we will see a lower performance on the validation set and the test set, so we will know that we need to fix something. But if the dev and test set are not representative of the production data, then we may not even know we have a problem.

I hope this is clear?

SiddharthChillale commented 4 years ago

The point was made clear. Thank you. I had read your answer a month after it was posted but I forgot to reply. Closing the issue

ageron / handson-ml

CH02: sampling test_set #459