Chapter 2 Clarification needed on "Create a Test Set'

VeenaKGit commented 4 years ago

@ageron The Book @page 55 states" Well, this works, but it is not perfect: if you run the program again, it will generate a different test set! Over time, you (or your Machine Learning algorithms) will get to see the whole dataset, which is what you want to avoid."

Question: Does Machine Learning algorithm retain learning from previous execution? Followup questions: If Yes , where do they retain how can we access/erase them? If No, then how will machine learning algorithms see the whole dataset Over time ? Please elaborate.

Thank you.

VeenaKGit commented 4 years ago

Hi Aurélien (or anyone who is seeing this message). I'm a beginner and trying to understand the concept. Any help is appreciated.

Praful932 commented 4 years ago

Hello @VeenaKGit , Let me try to answer your question. When you build a machine learning model, you often want to divide your data into three parts train set, validation set and test set( You will come across this later). Since the author has given an example, consider test set as usually the data that the model would encounter during production, so we want a dataset which the model will not ever see and after finalizing the model we will test it on this test set.

As for your question : Yes and No. Yes - The model learns over time adjusting its weights to detect patterns in the data, you can acess it's weights simply by checking out the weight matrix that it has learned. No - You might have read that there is something called online learning, which tweaks its patterns/weights on the fly, so if your model has the ability to learn on the fly, it can see the whole dataset over time.

Also regarding this statement if you run the program again, it will generate a different test set you can set the numpy parameter random_state=some_number to a fixed number, to get the same permutation each time.

You'll gain more clarity as you read, implement and go through the exercises :)

VeenaKGit commented 4 years ago

Thank you @Praful932 and I appreciate your response. I understand and totally agree with having to divide data into train, validate and test set. In Chapter 2 the author is talking about a batch learning algorithm example. I'm confused how can algorithm see the whole dataset overtime with random train_test_split method in Batch Learning. I totally agree to the point you made that over time the algorithm can see the whole dataset in Online Learning. Thanks for pointing that out.

I will assume that Author is in general talking about the problem with random train_test_split in Online learning and move on with the book.

I would still keep the issue 'open' to find out more.

ageron / handson-ml2

Chapter 2 Clarification needed on "Create a Test Set' #294