I can now run the randomForest / decisionTree algorithms for my project. I am still using the daphne container available on dockerhub.
The issue that I face now is, that the decision tree will not predict any of the positive class for my imbalanced data problem. Is there a way I can change the weight of predictions, so it will be encouraged to predict more in favour of the positive class, instead of going with the majority class? To be fair, it is a valid strategy to just predict the majority in my case, but it is not what I want.
To recreate the problem, you would have to download x_valid.csv and y_valid.csv from here again and load it based on your setup. I include the respective csv.meta files here, you'd have to remove the .json ending to use them:
import "/mnt/daphne_work/daphne/scripts/algorithms/decisionTree_.daph";
import "/mnt/daphne_work/daphne/scripts/algorithms/decisionTreePredict_.daph";
import "/mnt/daphne_work/daphne/scripts/algorithms/randomForest_.daph";
import "/mnt/daphne_work/daphne/scripts/algorithms/randomForestPredict_.daph";
# helper functions
def shape(df) {
print("shape (" + nrow(df) + ", " + ncol(df) + ")");
return nrow(df), ncol(df);
}
def display(df:matrix, start_at:si64, display_rows:si64, text:str) {
print(concat(text, ": "), 0); shape(df);
print(df[start_at:start_at+display_rows,], 1);
print("---");
}
# execution of pipeline
def main(verbose) {
t_start = now();
print("Running machine learning pipeline for ion beam tuning prediction.");
print("Reading in csv files.");
###########
# load data
###########
# read input data
# x_train = readMatrix("/mnt/daphne_work/ifat_semiconductors/data/x_train.csv")[1:,1:];
# y_train = readMatrix("/mnt/daphne_work/ifat_semiconductors/data/y_train.csv")[1:,1:];
x_valid = readMatrix("/mnt/daphne_work/ifat_semiconductors/data/x_valid.csv")[1:,1:];
y_valid = readMatrix("/mnt/daphne_work/ifat_semiconductors/data/y_valid.csv")[1:,1:];
# subset train set due to out-of-memory
x_train = x_valid[0:1000, :];
y_train = y_valid[0:1000, :];
x_valid = x_valid[1001:2000, :];
y_valid = y_valid[1001:2000, :];
# only take setup_result as a label
y_train = y_train[:, 0];
y_valid = y_valid[:, 0];
# display data
if (verbose) {
display(x_train, 0, 2, "x_train");
display(x_valid, 0, 2, "x_valid");
display(y_train, 0, 5, "y_train");
display(y_valid, 0, 5, "y_valid");
}
########################
# perform classification
########################
###########
# bin data
###########
# inits new x_train_binned and x_valid_binned variables
x_train_binned = bin(as.matrix(x_train[, 0]), 5);
for (i in 1:104) {
# starts from 1 on purpose, as 0 was used for init
x_train_binned = cbind(x_train_binned, bin(as.matrix(x_train[, i]), 50));
}
x_valid_binned = bin(as.matrix(x_valid[, 0]), 5);
for (i in 1:104) {
# starts from 1 on purpose, as 0 was used for init
x_valid_binned = cbind(x_valid_binned, bin(as.matrix(x_valid[, i]), 50));
}
# recodes y to y_binned
y_train_recoded, y_dict = recode(as.matrix(y_train), true);
y_valid_recoded, _ = recode(as.matrix(y_valid), true);
# switch from DAPHNE'S 0-based indexing to SystemDS's 1-based indexing.
x_train_binned = x_train_binned + 1;
x_valid_binned = x_valid_binned + 1;
y_train_recoded = y_train_recoded + 1;
y_valid_recoded = y_valid_recoded + 1;
# define data types
R = fill(1, 1, ncol(x_train_binned)+1); # needs to include y ctype at last position
R[, ncol(R) - 1] = as.matrix(2); # overwrites y ctype to be categorical (=2) instead of ordinal (=1)
if (verbose) {
print("---");
display(x_train_binned, 0, 2, "x_train_binned");
display(x_valid_binned, 0, 2, "x_valid_binned");
display(y_train_recoded, 0, 5, "y_train_recoded");
display(y_valid_recoded, 0, 5, "y_valid_recoded");
display(y_dict, 0, 2, "y_dict");
# decoded = y_dict[y_train_recoded, ];
# print(decoded);
display(R, 0, 1, "R");
}
###########
# train model
###########
# INPUT:
# ------------------------------------------------------------------------------
# X Feature matrix in recoded/binned representation
# y Label matrix in recoded/binned representation
# ctypes Row-Vector of column types [1 scale/ordinal, 2 categorical]
# of shape 1-by-(ncol(X)+1), where the last entry is the y type
# max_depth Maximum depth of the learned tree (stopping criterion)
# min_leaf Minimum number of samples in leaf nodes (stopping criterion),
# odd number recommended to avoid 50/50 leaf label decisions
# min_split Minimum number of samples in leaf for attempting a split
# max_features Parameter controlling the number of features used as split
# candidates at tree nodes: m = ceil(num_features^max_features)
# max_values Parameter controlling the number of values per feature used
# as split candidates: nb = ceil(num_values^max_values)
# impurity Impurity measure: entropy, gini (default), rss (regression)
# seed Fixed seed for randomization of samples and split candidates
# verbose Flag indicating verbose debug output
# to be conform with the expected datatypes
X = as.matrix<f64>(x_train_binned);
y = as.matrix<f64>(y_train_recoded);
Xv = as.matrix<f64>(x_valid_binned);
yv = as.matrix<f64>(y_valid_recoded);
R = as.matrix<f64>(R);
yhat = fill(0.0, 0, 0);
maxV = 1.0; # number of values per feature used as split candidates
dt = 1; # num decision trees in random forest
if( dt==1 ) {
M = decisionTree_.decisionTree(
/*X=*/X, /*y=*/y, /*ctypes=*/R,
/*max_depth=*/10, /*min_leaf=*/4, /*min_split=*/10,
/*max_features=*/1.0, /*max_values=*/maxV,
/*impurity=*/"gini", /*seed=*/7, /*verbose=*/true
);
# INPUT:
# ------------------------------------------------------------------------------
# X Feature matrix in recoded/binned representation
# y Label matrix in recoded/binned representation,
# optional for accuracy evaluation
# ctypes Row-Vector of column types [1 scale/ordinal, 2 categorical]
# M Matrix M holding the learned tree in linearized form
# see decisionTree() for the detailed tree representation.
# strategy Prediction strategy, can be one of ["GEMM", "TT", "PTT"],
# referring to "Generic matrix multiplication",
# "Tree traversal", and "Perfect tree traversal", respectively
# verbose Flag indicating verbose debug output
# ------------------------------------------------------------------------------
#
# OUTPUT:
# ------------------------------------------------------------------------------
# yhat Label vector of predictions
# ------------------------------------------------------------------------------
yhat = decisionTreePredict_.decisionTreePredict(
/*X=*/Xv, /*ctypes=*/R, /*M=*/M,
/*strategy=*/"TT", /*verbose=*/true
);
}
else {
# INPUT:
# ------------------------------------------------------------------------------
# X Feature matrix in recoded/binned representation
# y Label matrix in recoded/binned representation
# ctypes Row-Vector of column types [1 scale/ordinal, 2 categorical]
# of shape 1-by-(ncol(X)+1), where the last entry is the y type
# num_trees Number of trees to be learned in the random forest model
# sample_frac Sample fraction of examples for each tree in the forest
# feature_frac Sample fraction of features for each tree in the forest
# max_depth Maximum depth of the learned tree (stopping criterion)
# min_leaf Minimum number of samples in leaf nodes (stopping criterion)
# min_split Minimum number of samples in leaf for attempting a split
# max_features Parameter controlling the number of features used as split
# candidates at tree nodes: m = ceil(num_features^max_features)
# max_values Parameter controlling the number of values per feature used
# as split candidates: nb = ceil(num_values^max_values)
# impurity Impurity measure: entropy, gini (default), rss (regression)
# seed Fixed seed for randomization of samples and split candidates
# verbose Flag indicating verbose debug output
# ------------------------------------------------------------------------------
sf = 1.0/(dt - 1); # sample fraction
M = randomForest_.randomForest(
/*X=*/X, /*y=*/y, /*ctypes=*/R,
/*num_trees=*/dt - 1, /*sample_frac=*/sf,
/*feature_frac=*/1.0, /*max_depth=*/10, /*min_leaf=*/4, /*min_split=*/10,
/*max_features=*/0.5, /*max_values=*/maxV,
/*impurity=*/"gini", /*seed=*/7, /*verbose=*/true
);
yhat = randomForestPredict_.randomForestPredict(
/*X=*/Xv, /*y=*/yv, /*ctypes=*/R, /*M=*/M, /*verbose=*/true
);
}
# hacky way of decoding
yhat = yhat - 1;
yv = yv - 1;
display(yhat, 0, 5, "yhat");
display(yv, 0, 5, "yv");
print("rows in y and yhat:");
print(nrow(yv));
print(nrow(yhat));
print("1 entries (as opposed to 0s) within y and yhat:");
print(sum(yv));
print(sum(yhat));
acc = mean(yhat == yv);
print("accuracy: " + acc);
print("high accuracy expected, due to imbalanced nature of dataset");
# display timing info
msec_factor = as.f32(0.000001);
t_end = now();
print("Time elapsed to script completion: " + (as.f32((t_end - t_start)) * msec_factor) + " ms");
}
main(true);
Hi @bl1zzardx, as discussed in the meeting last Friday, I recommend the following:
Try a balanced training data set (in case you're not doing that already), where both classes are equally frequent.
Experiment with the parameters of the random forest, e.g., with more decision trees or higher maximum depth it could be more accurate.
As a last resort (or for testing purposes), you could inspect and modify the model manually, but that would not be the recommended way, since it is hardly reproducible and depends on internals of the model. In our implementation, the random forest model is a matrix where each row represents a single decision tree. If you extract a (n x 1) row and reshape it to (2 x n/2), then in those rows where the left column is zero, the right column is the predicted class label. You could first try to find out if the model really nevers produces your desired class label, or if your data just doesn't trigger the case. In theory, you could also modify the threshold values (right column) for individual features (left column, where not zero) manually, but this could have unexpected side effects on the model's behavior.
Please let us know if the problem still exists when you try 1. and 2.
Hi,
I can now run the randomForest / decisionTree algorithms for my project. I am still using the daphne container available on dockerhub.
The issue that I face now is, that the decision tree will not predict any of the positive class for my imbalanced data problem. Is there a way I can change the weight of predictions, so it will be encouraged to predict more in favour of the positive class, instead of going with the majority class? To be fair, it is a valid strategy to just predict the majority in my case, but it is not what I want.
To recreate the problem, you would have to download x_valid.csv and y_valid.csv from here again and load it based on your setup. I include the respective csv.meta files here, you'd have to remove the .json ending to use them: