ageron / handson-ml

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Apache License 2.0
25.15k stars 12.91k forks source link

CH02: sampling test_set #459

Closed SiddharthChillale closed 4 years ago

SiddharthChillale commented 5 years ago

I have a slight doubt in the part as to why are we sampling the test set and not the train set. Wasn't the point of sampling to get a training set that is representative of the cases we want to generalize to ?(book page 24)

Basically, in this code why are we using sampled test_set and not the sampled train_set for comparing in compare_props ?

def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()
ageron commented 5 years ago

Hi @SiddharthChillale ,

Thanks for your question, and sorry for the late response (I was on vacation).

Here's the purpose of the different sets:

Both the validation set and the test set should be as close as possible to the data that the model will see in production. You can think of them as setting the target: if they are not well aligned with the production data, then it is unlikely that your model will perform well when it's launched to production.

Of course, if the training data is not well aligned with the production data, then it's unlikely that your model will manage to get good performance on the validation set and the test set. So the training data should also be as close as possible to the production data. However, it's slightly less important, as you can always tweak your model and/or the training data many times until you get a satisfying model (as evaluated on the dev set and test set).

If you want to learn more about the discrepancy between the training data and the production data, how you can measure it (using a new held out set called the train-dev set), I encourage you to check out this deeplearning.ai video by Andrew Ng (and the following videos in the series): https://www.youtube.com/watch?v=1waHlpKiNyY

Coming back to your questions:

I hope this is clear?

SiddharthChillale commented 4 years ago

The point was made clear. Thank you. I had read your answer a month after it was posted but I forgot to reply. Closing the issue