ageron / handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
27.89k stars 12.76k forks source link

[QUESTION]Why Duplicates are encoraged in ensemble learning? #503

Closed Kirushikesh closed 2 years ago

Kirushikesh commented 2 years ago

In chapter 7, Ensemble Learning in bagging concept the bootstrap feature enables selection with replacement i.e., Multiple instances can be repeated in the training dataset of a predictor. So when bootstrap=True the training dataset of a predictor may contain duplicates. But when bootstrap=False the training data doesn't contain any duplicate instance(for a predictor). In general, in machine learning, there is an assumption stating all the training data points are iid(identical and independently distributed). So in general during preprocessing as of my knowledge I always try to delete the duplicates in the dataset. But bootstrap introduces duplicate instances in the data intentionally right? Was it correct? The author also stated that bootstrap works well when compared to pasting, Which contradicts me? Am I wrong? Please give clarifications @ageron

ageron commented 2 years ago

Hi @Kirushikesh ,

Thanks for your message, that's a great question!

Indeed, the training data should be IID, and it's a good idea to remove duplicates. But although bootstrapping will cause duplicates in the training data of each individual tree, different trees will not have exactly the same duplicates. If there are many trees, then all instances will be duplicated roughly as many times. So across the whole Random Forest, it's not an issue. Basically, bootstrapping increases the variance (i.e., individual trees are slightly more likely to produce different results), but it does not increase the bias (i.e., on average the trees won't favor any particular instance). The increased variance can often be cancelled out by increasing the number of trees. Overall, bootstrapping often performs better than pasting, and it has the advantage of being faster, since sampling with replacement is much faster than sampling without replacement (unless you sample close to 100% of the data).

If instead we duplicated random instances in the original training data, then this would increase the bias because all trees would be more likely to be trained on these duplicated instances.

Is this clearer?

ageron commented 2 years ago

I wrote a quick notebook to count how many trees use each instance, assuming there are 500 trees and 10,000 instances and we're using bootstrapping. Here's the result:

image

As you can see, the vast majority of instances are used by 450 to 550 trees, so they all have roughly the same weight. The most used instance is only used 1.4 times more than the least used, and that's the worse case scenario.

Kirushikesh commented 2 years ago

Hi @Kirushikesh ,

Thanks for your message, that's a great question!

Indeed, the training data should be IID, and it's a good idea to remove duplicates. But although bootstrapping will cause duplicates in the training data of each individual tree, different trees will not have exactly the same duplicates. If there are many trees, then all instances will be duplicated roughly as many times. So across the whole Random Forest, it's not an issue. Basically, bootstrapping increases the variance (i.e., individual trees are slightly more likely to produce different results), but it does not increase the bias (i.e., on average the trees won't favor any particular instance). The increased variance can often be cancelled out by increasing the number of trees. Overall, bootstrapping often performs better than pasting, and it has the advantage of being faster, since sampling with replacement is much faster than sampling without replacement (unless you sample close to 100% of the data).

If instead we duplicated random instances in the original training data, then this would increase the bias because all trees would be more likely to be trained on these duplicated instances.

Is this clearer?

Hello, @ageron thanks for the reply. I am being new to this ensemble learning techniques so you told Basically, bootstrapping increases the variance (i.e., individual trees are slightly more likely to produce different results), but it does not increase the bias (i.e., on average the trees won't favor any particular instance). I felt the sentence is being abstract how are you saying on average the trees won't favor any particular instance.

For example in linear regression complex models may tend to overfit which will be easy to observe and understand through this image. image

I can able to observe that the model has a low bias but high variance. Like wise can I expect some more explanation or some clarity on the statement.

ageron commented 2 years ago

Hi @Kirushikesh ,

Imagine if you pick 100 cats randomly out of 1000, and you give them one treat each. You've favored these 100 cats compared to the 900 others (assuming they all like treats). But if 500 other cat-lovers do the same as you, then, on average, all 1000 cats will get 50 treats each (100/1000 * 500). Some may get 47 and others 53, but that's not a big difference, all the cats will be happy.

That's what I meant by "on average the trees won't favor any particular instance": although each individual tree does favor some instances (since it totally ignores about 33% of the training instances), every instance is "cared for" by roughly the same number of trees in total.

For ensembles to work well, the predictors need to be diverse enough, so that they won't make the same mistakes at the same time. Bootstrapping increases the diversity of trees, which helps the ensemble perform better. But it requires increasing the number of trees to compensate for the increased variance.

Kirushikesh commented 2 years ago

Oh... Thanks, @ageron I understood now. You are a legend.

hahampis commented 7 months ago

Hi @ageron ,

First of all my utmost respect for really caring and answering with such detail. I am sorry for replying to an old closed topic; I am reading the third edition of the book and it says this about bagging vs pasting:

"Bagging introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting; but the extra diversity also means that the predictors end up being less correlated, so the ensemble’s variance is reduced."

But in your answer above you mention

"Basically, bootstrapping increases the variance (i.e., individual trees are slightly more likely to produce different results), but it does not increase the bias (i.e., on average the trees won't favor any particular instance). "

So, now I am really confused about the concepts of bias and variance... :)

I suspect your answer in this issue is made complete with the statement that "it (i.e. the ensemble) requires increasing the number of trees to compensate for the increased variance". However, I am still struggling to intuitively understand how bagging compares with pasting in terms of bias and variance.

Thank you for your awesome work.