aimacode / aima-python

Python implementation of algorithms from Russell And Norvig's "Artificial Intelligence - A Modern Approach"
MIT License
8.06k stars 3.81k forks source link

[query/suggestion] Choosing sampled data points in Adaboost #1047

Open SiluPanda opened 5 years ago

SiluPanda commented 5 years ago

Are we using sampled data for hypothesis training in AdaBoost? If not, should not we be doing that? Sampled data means, choosing a set of data points from the training set according to their weights. It increases the chance of a misclassified point to get chosen with a greater probability by the next hypothesis. Here is the implementation of AdaBoost done by me. I have implemented sampling of data here.

Let me know if needs to be done.

antmarakis commented 5 years ago

Hmm, this looks very interesting, but I am not sure if the AdaBoost implementation is needed at the moment. For that, only @norvig can respond, but he is very busy at the moment. If you are doing this for GSoC, you can add in your proposal that you will do some work on neural networks. I think that would be a really interesting idea, but as a PR, it would be too large for me to merge.

SiluPanda commented 5 years ago

Yes, Thank you, I'll add that on my proposal. Another small query, what is exaclty size here? I am looking to patch the infinite loop. Thanks a lot for the response!

antmarakis commented 5 years ago

To be honest, I don't know. The cross validation pseudocode is not up to data and we don't know what to do.

SiluPanda commented 5 years ago

This is what size means from the book:

In this section we explain how to select among models that are parameterized by size. For example, with polynomials we have size = 1 for linear functions, size = 2 for quadratics, and so on. For decision trees, the size could be the number of nodes in the tree. In all cases we want to find the value of the size parameter that best balances underfitting and overfitting to give the best test set accuracy.

Clearly, size should not go till infinity(as it is in the pseudocode) and the upper limit to it is model specific. I guess the best option is to wait for an update on the pseudocode from @norvig sir.