daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 62 forks source link

decisiontree predicts only the majority class for imbalanced data. #686

Closed bl1zzardx closed 5 months ago

bl1zzardx commented 7 months ago

Hi,

I can now run the randomForest / decisionTree algorithms for my project. I am still using the daphne container available on dockerhub.

The issue that I face now is, that the decision tree will not predict any of the positive class for my imbalanced data problem. Is there a way I can change the weight of predictions, so it will be encouraged to predict more in favour of the positive class, instead of going with the majority class? To be fair, it is a valid strategy to just predict the majority in my case, but it is not what I want.

To recreate the problem, you would have to download x_valid.csv and y_valid.csv from here again and load it based on your setup. I include the respective csv.meta files here, you'd have to remove the .json ending to use them:

import "/mnt/daphne_work/daphne/scripts/algorithms/decisionTree_.daph";
import "/mnt/daphne_work/daphne/scripts/algorithms/decisionTreePredict_.daph";
import "/mnt/daphne_work/daphne/scripts/algorithms/randomForest_.daph";
import "/mnt/daphne_work/daphne/scripts/algorithms/randomForestPredict_.daph";

# helper functions
def shape(df) {
    print("shape (" + nrow(df) + ", " + ncol(df) + ")");
    return nrow(df), ncol(df);
} 

def display(df:matrix, start_at:si64, display_rows:si64, text:str) {
    print(concat(text, ": "), 0); shape(df);
    print(df[start_at:start_at+display_rows,], 1);
    print("---");
}

# execution of pipeline
def main(verbose) {
    t_start = now();
    print("Running machine learning pipeline for ion beam tuning prediction.");
    print("Reading in csv files.");

    ###########
    # load data
    ###########

    # read input data
    # x_train = readMatrix("/mnt/daphne_work/ifat_semiconductors/data/x_train.csv")[1:,1:];
    # y_train = readMatrix("/mnt/daphne_work/ifat_semiconductors/data/y_train.csv")[1:,1:];
    x_valid = readMatrix("/mnt/daphne_work/ifat_semiconductors/data/x_valid.csv")[1:,1:];
    y_valid = readMatrix("/mnt/daphne_work/ifat_semiconductors/data/y_valid.csv")[1:,1:];

    # subset train set due to out-of-memory
    x_train = x_valid[0:1000, :];
    y_train = y_valid[0:1000, :];
    x_valid = x_valid[1001:2000, :];
    y_valid = y_valid[1001:2000, :];

    # only take setup_result as a label 
    y_train = y_train[:, 0];
    y_valid = y_valid[:, 0];

    # display data
    if (verbose) {
        display(x_train, 0, 2, "x_train");
        display(x_valid, 0, 2, "x_valid");
        display(y_train, 0, 5, "y_train");
        display(y_valid, 0, 5, "y_valid");
    }

    ########################
    # perform classification
    ########################

    ###########
    # bin data
    ###########

    # inits new x_train_binned and x_valid_binned variables
    x_train_binned = bin(as.matrix(x_train[, 0]), 5);
    for (i in 1:104) {
        # starts from 1 on purpose, as 0 was used for init
        x_train_binned = cbind(x_train_binned, bin(as.matrix(x_train[, i]), 50));
    }

    x_valid_binned = bin(as.matrix(x_valid[, 0]), 5);
    for (i in 1:104) {
        # starts from 1 on purpose, as 0 was used for init
        x_valid_binned = cbind(x_valid_binned, bin(as.matrix(x_valid[, i]), 50));
    }

    # recodes y to y_binned
    y_train_recoded, y_dict = recode(as.matrix(y_train), true);
    y_valid_recoded, _ = recode(as.matrix(y_valid), true);

    # switch from DAPHNE'S 0-based indexing to SystemDS's 1-based indexing.
    x_train_binned = x_train_binned + 1;
    x_valid_binned = x_valid_binned + 1;
    y_train_recoded = y_train_recoded + 1;
    y_valid_recoded = y_valid_recoded + 1;

    # define data types
    R = fill(1, 1, ncol(x_train_binned)+1); # needs to include y ctype at last position
    R[, ncol(R) - 1] = as.matrix(2); # overwrites y ctype to be categorical (=2) instead of ordinal (=1)

    if (verbose) {
        print("---");
        display(x_train_binned, 0, 2, "x_train_binned");
        display(x_valid_binned, 0, 2, "x_valid_binned");
        display(y_train_recoded, 0, 5, "y_train_recoded");
        display(y_valid_recoded, 0, 5, "y_valid_recoded");
        display(y_dict, 0, 2, "y_dict");

        # decoded = y_dict[y_train_recoded, ];
        # print(decoded);

        display(R, 0, 1, "R");
    }

    ###########
    # train model
    ###########

    # INPUT:
    # ------------------------------------------------------------------------------
    # X               Feature matrix in recoded/binned representation
    # y               Label matrix in recoded/binned representation
    # ctypes          Row-Vector of column types [1 scale/ordinal, 2 categorical]
    #                 of shape 1-by-(ncol(X)+1), where the last entry is the y type
    # max_depth       Maximum depth of the learned tree (stopping criterion)
    # min_leaf        Minimum number of samples in leaf nodes (stopping criterion),
    #                 odd number recommended to avoid 50/50 leaf label decisions
    # min_split       Minimum number of samples in leaf for attempting a split
    # max_features    Parameter controlling the number of features used as split
    #                 candidates at tree nodes: m = ceil(num_features^max_features)
    # max_values      Parameter controlling the number of values per feature used
    #                 as split candidates: nb = ceil(num_values^max_values)
    # impurity        Impurity measure: entropy, gini (default), rss (regression)
    # seed            Fixed seed for randomization of samples and split candidates
    # verbose         Flag indicating verbose debug output

    # to be conform with the expected datatypes
    X = as.matrix<f64>(x_train_binned);
    y = as.matrix<f64>(y_train_recoded);
    Xv = as.matrix<f64>(x_valid_binned);
    yv = as.matrix<f64>(y_valid_recoded);
    R = as.matrix<f64>(R);

    yhat = fill(0.0, 0, 0);
    maxV = 1.0; # number of values per feature used as split candidates
    dt = 1; # num decision trees in random forest

    if( dt==1 ) {
    M = decisionTree_.decisionTree(
        /*X=*/X, /*y=*/y, /*ctypes=*/R,
        /*max_depth=*/10, /*min_leaf=*/4, /*min_split=*/10,
        /*max_features=*/1.0, /*max_values=*/maxV,
        /*impurity=*/"gini", /*seed=*/7, /*verbose=*/true
    );

    # INPUT:
    # ------------------------------------------------------------------------------
    # X               Feature matrix in recoded/binned representation
    # y               Label matrix in recoded/binned representation,
    #                 optional for accuracy evaluation
    # ctypes          Row-Vector of column types [1 scale/ordinal, 2 categorical]
    # M               Matrix M holding the learned tree in linearized form
    #                 see decisionTree() for the detailed tree representation.
    # strategy        Prediction strategy, can be one of ["GEMM", "TT", "PTT"],
    #                 referring to "Generic matrix multiplication",
    #                 "Tree traversal", and "Perfect tree traversal", respectively
    # verbose         Flag indicating verbose debug output
    # ------------------------------------------------------------------------------
    #
    # OUTPUT:
    # ------------------------------------------------------------------------------
    # yhat            Label vector of predictions
    # ------------------------------------------------------------------------------

    yhat = decisionTreePredict_.decisionTreePredict(
        /*X=*/Xv, /*ctypes=*/R, /*M=*/M,
        /*strategy=*/"TT", /*verbose=*/true
    );
    }
    else {

    # INPUT:
    # ------------------------------------------------------------------------------
    # X               Feature matrix in recoded/binned representation
    # y               Label matrix in recoded/binned representation
    # ctypes          Row-Vector of column types [1 scale/ordinal, 2 categorical]
    #                 of shape 1-by-(ncol(X)+1), where the last entry is the y type
    # num_trees       Number of trees to be learned in the random forest model
    # sample_frac     Sample fraction of examples for each tree in the forest
    # feature_frac    Sample fraction of features for each tree in the forest
    # max_depth       Maximum depth of the learned tree (stopping criterion)
    # min_leaf        Minimum number of samples in leaf nodes (stopping criterion)
    # min_split       Minimum number of samples in leaf for attempting a split
    # max_features    Parameter controlling the number of features used as split
    #                 candidates at tree nodes: m = ceil(num_features^max_features)
    # max_values      Parameter controlling the number of values per feature used
    #                 as split candidates: nb = ceil(num_values^max_values)
    # impurity        Impurity measure: entropy, gini (default), rss (regression)
    # seed            Fixed seed for randomization of samples and split candidates
    # verbose         Flag indicating verbose debug output
    # ------------------------------------------------------------------------------
    sf = 1.0/(dt - 1); # sample fraction
    M = randomForest_.randomForest(
        /*X=*/X, /*y=*/y, /*ctypes=*/R,
        /*num_trees=*/dt - 1, /*sample_frac=*/sf, 
        /*feature_frac=*/1.0, /*max_depth=*/10, /*min_leaf=*/4, /*min_split=*/10,
        /*max_features=*/0.5, /*max_values=*/maxV,
        /*impurity=*/"gini", /*seed=*/7, /*verbose=*/true
    );
    yhat = randomForestPredict_.randomForestPredict(
        /*X=*/Xv, /*y=*/yv, /*ctypes=*/R, /*M=*/M, /*verbose=*/true
    );
    }

    # hacky way of decoding
    yhat = yhat - 1;
    yv = yv - 1;

    display(yhat, 0, 5, "yhat");
    display(yv, 0, 5, "yv");

    print("rows in y and yhat:");
    print(nrow(yv));
    print(nrow(yhat));

    print("1 entries (as opposed to 0s) within y and yhat:");
    print(sum(yv));
    print(sum(yhat));

    acc = mean(yhat == yv);
    print("accuracy: " + acc);
    print("high accuracy expected, due to imbalanced nature of dataset");

    # display timing info
    msec_factor = as.f32(0.000001);
    t_end = now();
    print("Time elapsed to script completion:  " + (as.f32((t_end - t_start)) * msec_factor) + " ms");
}

main(true);
pdamme commented 7 months ago

Hi @bl1zzardx, as discussed in the meeting last Friday, I recommend the following:

  1. Try a balanced training data set (in case you're not doing that already), where both classes are equally frequent.
  2. Experiment with the parameters of the random forest, e.g., with more decision trees or higher maximum depth it could be more accurate.
  3. As a last resort (or for testing purposes), you could inspect and modify the model manually, but that would not be the recommended way, since it is hardly reproducible and depends on internals of the model. In our implementation, the random forest model is a matrix where each row represents a single decision tree. If you extract a (n x 1) row and reshape it to (2 x n/2), then in those rows where the left column is zero, the right column is the predicted class label. You could first try to find out if the model really nevers produces your desired class label, or if your data just doesn't trigger the case. In theory, you could also modify the threshold values (right column) for individual features (left column, where not zero) manually, but this could have unexpected side effects on the model's behavior.

Please let us know if the problem still exists when you try 1. and 2.

bl1zzardx commented 5 months ago

so i tried 1. and 2. and it works for me now. thank you