[Discussion] Implementation of AdaBoost

Tanvi141 commented 4 years ago

Description of the problem

Task is to discuss implementation of both Multi-Class and Two-Class AdaBoost. Paper we have referred to is this: https://web.stanford.edu/~hastie/Papers/samme.pdf

Current Thoughts

For the multi-class adaboost we will implement Algorithm 4 in the paper which is SAMME.R

Initial Doubts

For two-class adaboost, which algorithm should we be following?
We are having some trouble understanding the paper. Especially this line in Algorithm 4:

where the RHS is expanded as:

We are not able to understand what the f and the I symbol mean.

czgdp1807 commented 4 years ago

For two-class adaboost, which algorithm should we be following?

We can start with the basic Algorithm 1.

We are having some trouble understanding the paper. Especially this line in Algorithm 4:

Will get back to you on this. Until then if your doubt is resolved let me know.

czgdp1807 commented 4 years ago

You can propose the API for algorithm 1 here in this issue.

fiza11 commented 4 years ago

Use n + 1 vectors or 1 vector of structs for n features. Both will use templates.
Use a vector of weights and initialize it to 1/n initially using the fill method.
For point 2(a), build a decision tree with each feature with a depth of 1. Then, use every decision tree to classify the data. Afterwards, compare the predictions made by each tree with the actual labels in the training set. Select the feature with the smallest number of incorrect predictions.
For 2(b), calculate the error by calculating the sum of weights for incorrectly classified samples.
Calculate significance and then update the weights vector.
Function for building a stump for a particular feature:
- For continuous data: It will return a threshold value eg 10. So then we will know to check above 10/below 10 for the two classes.
- For discrete data: If we have values like ["red", "green", "blue"] then the function will return a vector [class 1, class 2, class 1]

This is how we plan to store each T, but then each T will not be the same. So the question is how can we store all the T's together? Also, can we use STL like sets and pairs?

czgdp1807 commented 4 years ago

Can you please share some fake function calls? Like how would the end user call the function to run the AdaBoost algorithm and in return what will be the output.

Tanvi141 commented 4 years ago

For the data points: The main function will take an input of the dataset x and classes y. Suppose each x(i) has n features. As we mentioned, x can either be a collection of n vectors + 1 more vector y. Or else it can be a single vector of data type struct, and the struct contains all the features and the output for a particular data point.

Output as storing all the stumps together: Output as we mentioned is a collection of stumps, each will have different data types so we are not sure how to store them all together. We believe each stump can be represented as a struct. But data type of elements of each struct may be different for each stump. Since each feature may not be of the same data type.

Hence fake function call will be like so: <collection of stumps> compute_adaboost(<collection of all data points in input format>)

Then there can be another function which knows how to interpret the <collection of stumps> which can classify any incoming new data point. Call for this will be: <class type> predict_class (<collection of stumps>, <the single data point>) which returns the predicted class.

Further discussion of representing the stump This link states that for a particular weak classifier stump, we need only find the threshold value. https://towardsdatascience.com/understanding-adaboost-2f94f22d5bfe Consider the following proposed struct for representing the stump.

template <class data_type_feature>
struct stump{
    int feature;     //to store what feature the stump was built on
    data_type_feature threshold_value.
}

As I mentioned above, the issue of storing all the stumps together is because each stump is made using a particular feature. For any categorical feature (such as color of object) we can use one-hot-encoding (https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/). We will clearly need to remove all string type variables and make them numerical. So some kind of pre-proccessing of the data is required. But again in numerical there are various data types like float, int etc. So how will we deal with this issue?

czgdp1807 commented 4 years ago

Why can't we use adaboost::core::Matrix for the dataset or adaboost::core::Vector for the classes? Features will be some floating point numbers right?

Tanvi141 commented 4 years ago

Ok great that solves many of our doubts. We will be implementing Algorithm 1 from https://web.stanford.edu/~hastie/Papers/samme.pdf as follows:

Here is the proposed directory structure.

adaboost/
|_directory xxx/
  |_ algorithm.hpp
  |_ algorithm_impl.cpp
  |_ utilities.hpp
  |_ utilities_impl.cpp

Each stump (weak classifier) will be represented as a struct:

struct stump{
    float feature;    //the feature based on which it is classified
    float threshold;
    float alpha;    //as computed in 2(c)
}

algorithm.hpp will have two fuctions:

Vector <struct stump> two_class_adaboost(adaboost::core::Matrix data ):
Input: Matrix of floats. Suppose input data has n features. Then Matrix will have n+1 columns. The last column will be the class type, ie output variable y for each example. Output: A vector of stumps, that is the list of all the weak classifiers.
float predict_class(Vector<struct stump> stumps, Vector<float> example) Input: A vector which is the list of stumps, and the example which is the be classified. Output: Predicted class for that example. Computed using point 3 from algorithm 1.

utilities.hpp will contain all the utility functions that two_class_adaboost and predict_class will call. They are:

struct stump generate_stump(unsigned feature, Matrix data, Vector weights) Input: Feature upon which the stump is to be fit, the training data, and weights of each example. Output: The optimal stump. We believe the optimal stump can be found in an O(n*logn) method: by first sorting all the values in the features, then an O(n) pass to figure out the optimal threshold.
struct stump find_optimal _stump(Matrix data, Vector weights) Input: The data and weights. Output: The optimal stump returned in 2(a). This function simply loops over all the features and calls generate_stump for each feature. It finds the best one amongst these and returns it.
void update_weights(Matrix data, Vector weights, struct stump T) Input: The training data, the weights and the selected stump. It updates the weights as in 2(d).

Please let us know what you think. We are ready to split this work amongst ourselves and begin coding it.

czgdp1807 commented 4 years ago

I think the pipeline of training the algorithm, then using it for Inferencing should be more clear. How will you fit the weak classifiers while training? The paper doesn't specify clearly in the algorithm about that. Rest I have a design in my mind, will comment tomorrow here.

czgdp1807 commented 4 years ago

begin coding it.

No hurries.

What I am thinking is that training, and predicting should require only the data from the user, and rest of the implementation details should be hidden.

template <class data_type>
class AdaBoost
{
    private Matrix<data_type> data;
    private Vector<unsigned> classes;
    private Vector<Classifier> T;
    private Vector<date_type> alpha;

    private AdaBoost(Matrix<data_type> data, Vector<unsigned> classes);

    static public AdaBoost* createModel(Matrix<data_type> data, Vector<unsigned> classes); 

    void train();

    unsigned predict(Vector<data_type> input);

    Vector<unsigned> predict(Matrix<data_type> input);

    Vector<Classifier> get_classifier();

    Vector<data_type> get_weights(); 
}

Now, the thing to be decided upon is how to decide the interface of the classifer. The algorithm has a step,

Fit a classifier T(m)(x) to the training data using weights w_i.

How will we actually do the above step?

Tanvi141 commented 4 years ago

Fit a classifier T(m)(x) to the training data using weights w_i.

How will we actually do the above step?

So in order to do this we will need to implement fitting a weak classifier. Here is our plan. The input is some training samples with n features. We need to find a weak classifier based on one of these n features. So:

pseudocode:

best_classifier  = null
for i in n:
    find weak classifier for feature i
    if it is a better classifier than best_classifier then update best_classifier (use GINI INDEX to decide which is better)
return best_classifier

How to find weak classifier for feature i? We have the vector of weights. There are two alternatives for this as stated in https://cseweb.ucsd.edu/~yfreund/papers/IntroToBoosting.pdf (page 2 last para)

Method 1:

Sample a subset of the training examples using cumulative vector of weights to obtain a new sample. This means we will need to update the data set as well. Data points with higher weights will come into the data set with more probability and may even come multiple times. (After re-sampling the vector of weights will again have all values equal ie 1/num of samples)
Now to fit the weak classifier we need to find the threshold value. There are multiple possibilities. We try them all for best results. Sort all the possible threshold values (which are all values in feature i) and try each one by one.
Thus there are multiple candidate threshold values. Compare them using GINI INDEX(http://www.learnbymarketing.com/481/decision-tree-flavors-gini-info-gain/#:~:text=Summary%3A%20The%20Gini%20Index%20is,2)%20of%20that%20class%20probability.) Return the best stump found.

Method 2:

This means no re-sampling. Directly proceed with step two of the above algorithm. Only now to find the best possible classifier, instead of using GINI INDEX, we should use another method of calculation which also takes in the weights of the data points. I have not yet been able to find an appropriate formula which takes weights of individual examples into account when measuring the performance of a decision tree. But if we are able to find one we can implement it.

Which method to choose?

czgdp1807 commented 4 years ago

As suggested in this paper there are multiple ways to find any weak classifier that does the job of classifying examples better than random guessing. So, the first thing to take care of is to allow the end user to define their own weak learning algorithm and use it inside AdaBoost training. So, the interface of any weak classifier should look like as follows,

struct BaseProperties
{
};

template <class data_type>
class WeakClassifier
{
    private Matrix<data_type>* data;
    private Vector<unsigned>* classes;
    private BaseProperties* classifier_information;

    private WeakClassifier(Matrix<data_type> data, Vector<unsigned> classes);

    static public WeakClassifier* createWeakClassifier(Matrix<data_type> data, Vector<unsigned> classes); 

    void train(Vector<data_type> example_weights);

    unsigned predict(Vector<data_type> input);

    Vector<unsigned> predict(Matrix<data_type> input);
}

Now, inside train method end user should be able define their own way of training the weak classifier learning algorithm. The struct, BaseProperties will contain the necessary information related to a weak classifier, for example, by inheriting the empty struct, the user can add their own properties specific to their classifier and use them inside the definition of train and predict methods.

Now, coming on to implementing a default weak classifier, I had thought of using a linear weak classifier, which simply uses a linearly weighted sum of the features of inputs. While training the weak classifier, simple gradient descent can be applied on binary-crossentropy and then after every update, we can check whether it works better than random-guessing(see the function in page 18, step 1 of http://people.csail.mit.edu/dsontag/courses/ml12/slides/lecture13.pdf), if it is stop the training and return to the main algorithm.

I don't know if it will work but since the task of finding weak classifiers is pretty open, any method should be acceptable.

Tanvi141 commented 4 years ago

Ok, we will go ahead with user-defined weak classifier then.

However every source we looked at online mentions using decision stumps as a weak classifier. It seems this is the traditionally used weak classifier so we feel that this will be the best fit to choose as the default weak classifier.

czgdp1807 commented 4 years ago

Well, on what loss function do these decision stumps optimise? Or in other words, how will it be ensured that these stumps are better than random guesses. Also, what will be go inside the WeakClassifier.train of these stumps?

Tanvi141 commented 4 years ago

Well, on what loss function do these decision stumps optimise? Or in other words, how will it be ensured that these stumps are better than random guesses.

These stumps will definitely be better than random guesses. As we mentioned above, the method we would use to compare will be Gini Index or Gini Impurity. There are some variations of how to calculate it, we have followed the one illustrated here: https://stats.stackexchange.com/questions/308885/a-simple-clear-explanation-of-the-gini-impurity

How to ensure that the stump will be better than a random classifier? Gini Impurity of a random classifier will have each value p_i = 1/2 since there are two classes. Plugging this in the formula gives Gini Impurity of a random classifier as 0.5. It can be proven through simple mathematics that Gini Impurity will never be greater than 0.5 for any values of p_i. This link illustrates the graph of Gini Index for various values: https://towardsdatascience.com/gini-impurity-measure-dbd3878ead33

Also, what will be go inside the WeakClassifier.train of these stumps?

Please see the explanation we gave in our initial design proposal: https://github.com/codezonediitj/adaboost/issues/33#issuecomment-668784933

czgdp1807 commented 4 years ago

I see. You can proceed with the implementation. Make a new folder, adaboost under project root and create hpp files for interfaces and cpp for implementing those interfaces. Follow the interfaces suggested above. WeakClassifier should be abstract class and it should be inherited by something GiniIndexWeakClassifier where you can implement your idea.

codezonediitj / adaboost