Open Tanvi141 opened 4 years ago
- For two-class adaboost, which algorithm should we be following?
We can start with the basic Algorithm 1.
- We are having some trouble understanding the paper. Especially this line in Algorithm 4:
Will get back to you on this. Until then if your doubt is resolved let me know.
You can propose the API for algorithm 1 here in this issue.
n
features. Both will use templates. 1/n
initially using the fill
method. This is how we plan to store each T, but then each T will not be the same. So the question is how can we store all the T's together? Also, can we use STL like sets and pairs?
Can you please share some fake function calls? Like how would the end user call the function to run the AdaBoost algorithm and in return what will be the output.
For the data points:
The main function will take an input of the dataset x
and classes y
. Suppose each x(i) has n
features. As we mentioned, x
can either be a collection of n
vectors + 1 more vector y
. Or else it can be a single vector of data type struct, and the struct contains all the features and the output for a particular data point.
Output as storing all the stumps together: Output as we mentioned is a collection of stumps, each will have different data types so we are not sure how to store them all together. We believe each stump can be represented as a struct. But data type of elements of each struct may be different for each stump. Since each feature may not be of the same data type.
Hence fake function call will be like so:
<collection of stumps> compute_adaboost(<collection of all data points in input format>)
Then there can be another function which knows how to interpret the <collection of stumps>
which can classify any incoming new data point. Call for this will be:
<class type> predict_class (<collection of stumps>, <the single data point>)
which returns the predicted class.
Further discussion of representing the stump This link states that for a particular weak classifier stump, we need only find the threshold value. https://towardsdatascience.com/understanding-adaboost-2f94f22d5bfe Consider the following proposed struct for representing the stump.
template <class data_type_feature>
struct stump{
int feature; //to store what feature the stump was built on
data_type_feature threshold_value.
}
As I mentioned above, the issue of storing all the stumps together is because each stump is made using a particular feature. For any categorical feature (such as color of object) we can use one-hot-encoding (https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/). We will clearly need to remove all string type variables and make them numerical. So some kind of pre-proccessing of the data is required. But again in numerical there are various data types like float
, int
etc. So how will we deal with this issue?
Why can't we use adaboost::core::Matrix
for the dataset or adaboost::core::Vector
for the classes? Features will be some floating point numbers right?
Ok great that solves many of our doubts. We will be implementing Algorithm 1 from https://web.stanford.edu/~hastie/Papers/samme.pdf as follows:
Here is the proposed directory structure.
adaboost/
|_directory xxx/
|_ algorithm.hpp
|_ algorithm_impl.cpp
|_ utilities.hpp
|_ utilities_impl.cpp
Each stump (weak classifier) will be represented as a struct:
struct stump{
float feature; //the feature based on which it is classified
float threshold;
float alpha; //as computed in 2(c)
}
algorithm.hpp
will have two fuctions:
Vector <struct stump> two_class_adaboost(adaboost::core::Matrix data )
:
Input: Matrix of floats. Suppose input data has n
features. Then Matrix will have n+1
columns. The last column will be the class type, ie output variable y
for each example.
Output: A vector of stumps, that is the list of all the weak classifiers.
float predict_class(Vector<struct stump> stumps, Vector<float> example)
Input: A vector which is the list of stumps, and the example which is the be classified.
Output: Predicted class for that example. Computed using point 3 from algorithm 1.
utilities.hpp
will contain all the utility functions that two_class_adaboost
and predict_class
will call. They are:
struct stump generate_stump(unsigned feature, Matrix data, Vector weights)
Input: Feature upon which the stump is to be fit, the training data, and weights of each example.
Output: The optimal stump.
We believe the optimal stump can be found in an O(n*logn) method: by first sorting all the values in the features, then an O(n) pass to figure out the optimal threshold.
struct stump find_optimal _stump(Matrix data, Vector weights)
Input: The data and weights.
Output: The optimal stump returned in 2(a).
This function simply loops over all the features and calls generate_stump
for each feature. It finds the best one amongst these and returns it.
void update_weights(Matrix data, Vector weights, struct stump T)
Input: The training data, the weights and the selected stump.
It updates the weights as in 2(d).
Please let us know what you think. We are ready to split this work amongst ourselves and begin coding it.
I think the pipeline of training the algorithm, then using it for Inferencing should be more clear. How will you fit the weak classifiers while training? The paper doesn't specify clearly in the algorithm about that. Rest I have a design in my mind, will comment tomorrow here.
begin coding it.
No hurries.
What I am thinking is that training, and predicting should require only the data from the user, and rest of the implementation details should be hidden.
template <class data_type>
class AdaBoost
{
private Matrix<data_type> data;
private Vector<unsigned> classes;
private Vector<Classifier> T;
private Vector<date_type> alpha;
private AdaBoost(Matrix<data_type> data, Vector<unsigned> classes);
static public AdaBoost* createModel(Matrix<data_type> data, Vector<unsigned> classes);
void train();
unsigned predict(Vector<data_type> input);
Vector<unsigned> predict(Matrix<data_type> input);
Vector<Classifier> get_classifier();
Vector<data_type> get_weights();
}
Now, the thing to be decided upon is how to decide the interface of the classifer. The algorithm has a step,
Fit a classifier T(m)(x) to the training data using weights w_i.
How will we actually do the above step?
Fit a classifier T(m)(x) to the training data using weights w_i.
How will we actually do the above step?
So in order to do this we will need to implement fitting a weak classifier. Here is our plan.
The input is some training samples with n
features. We need to find a weak classifier based on one of these n features. So:
pseudocode:
best_classifier = null
for i in n:
find weak classifier for feature i
if it is a better classifier than best_classifier then update best_classifier (use GINI INDEX to decide which is better)
return best_classifier
How to find weak classifier for feature i
? We have the vector of weights.
There are two alternatives for this as stated in https://cseweb.ucsd.edu/~yfreund/papers/IntroToBoosting.pdf (page 2 last para)
Method 1:
Method 2:
This means no re-sampling. Directly proceed with step two of the above algorithm. Only now to find the best possible classifier, instead of using GINI INDEX, we should use another method of calculation which also takes in the weights of the data points. I have not yet been able to find an appropriate formula which takes weights of individual examples into account when measuring the performance of a decision tree. But if we are able to find one we can implement it.
Which method to choose?
As suggested in this paper there are multiple ways to find any weak classifier that does the job of classifying examples better than random guessing. So, the first thing to take care of is to allow the end user to define their own weak learning algorithm and use it inside AdaBoost training. So, the interface of any weak classifier should look like as follows,
struct BaseProperties
{
};
template <class data_type>
class WeakClassifier
{
private Matrix<data_type>* data;
private Vector<unsigned>* classes;
private BaseProperties* classifier_information;
private WeakClassifier(Matrix<data_type> data, Vector<unsigned> classes);
static public WeakClassifier* createWeakClassifier(Matrix<data_type> data, Vector<unsigned> classes);
void train(Vector<data_type> example_weights);
unsigned predict(Vector<data_type> input);
Vector<unsigned> predict(Matrix<data_type> input);
}
Now, inside train
method end user should be able define their own way of training the weak classifier learning algorithm. The struct, BaseProperties
will contain the necessary information related to a weak classifier, for example, by inheriting the empty struct, the user can add their own properties specific to their classifier and use them inside the definition of train
and predict
methods.
Now, coming on to implementing a default weak classifier, I had thought of using a linear weak classifier, which simply uses a linearly weighted sum of the features of inputs. While training the weak classifier, simple gradient descent can be applied on binary-crossentropy and then after every update, we can check whether it works better than random-guessing(see the function in page 18, step 1 of http://people.csail.mit.edu/dsontag/courses/ml12/slides/lecture13.pdf), if it is stop the training and return to the main algorithm.
I don't know if it will work but since the task of finding weak classifiers is pretty open, any method should be acceptable.
Ok, we will go ahead with user-defined weak classifier then.
However every source we looked at online mentions using decision stumps as a weak classifier. It seems this is the traditionally used weak classifier so we feel that this will be the best fit to choose as the default weak classifier.
Well, on what loss function do these decision stumps optimise? Or in other words, how will it be ensured that these stumps are better than random guesses. Also, what will be go inside the WeakClassifier.train
of these stumps?
Well, on what loss function do these decision stumps optimise? Or in other words, how will it be ensured that these stumps are better than random guesses.
These stumps will definitely be better than random guesses. As we mentioned above, the method we would use to compare will be Gini Index or Gini Impurity. There are some variations of how to calculate it, we have followed the one illustrated here: https://stats.stackexchange.com/questions/308885/a-simple-clear-explanation-of-the-gini-impurity
How to ensure that the stump will be better than a random classifier? Gini Impurity of a random classifier will have each value pi = 1/2 since there are two classes. Plugging this in the formula gives Gini Impurity of a random classifier as 0.5
.
It can be proven through simple mathematics that Gini Impurity will never be greater than 0.5
for any values of pi. This link illustrates the graph of Gini Index for various values: https://towardsdatascience.com/gini-impurity-measure-dbd3878ead33
Also, what will be go inside the
WeakClassifier.train
of these stumps?
Please see the explanation we gave in our initial design proposal: https://github.com/codezonediitj/adaboost/issues/33#issuecomment-668784933
I see. You can proceed with the implementation. Make a new folder, adaboost
under project root and create hpp
files for interfaces and cpp
for implementing those interfaces. Follow the interfaces suggested above. WeakClassifier
should be abstract class and it should be inherited by something GiniIndexWeakClassifier
where you can implement your idea.
Description of the problem
Task is to discuss implementation of both Multi-Class and Two-Class AdaBoost. Paper we have referred to is this: https://web.stanford.edu/~hastie/Papers/samme.pdf
Current Thoughts
SAMME.R
Initial Doubts
where the RHS is expanded as:
We are not able to understand what the
f
and theI
symbol mean.