daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 62 forks source link

Issues with randomForest/decisionTree invocation #681

Closed bl1zzardx closed 7 months ago

bl1zzardx commented 8 months ago

Hi,

I just tried to run the randomForest / decisionTree algorithms for my project and encountered a couple of different errors, depending on the parameterization used. I am using the latest daphne container available on dockerhub.

encountered issues:

To recreate the problem, you would have to download x_valid.csv and y_valid.csv from here and load it based on your setup. I include the respective csv.meta files here, you'd have to remove the .json ending to use them:

def read_csv(path: str) -> matrix {
    df = readMatrix(path);  
    print("Read csv at " + path + ": ", 0);
    # first row with column headers is turned to nan, thus ignored
    # first column has no header and only displays an unnecessary index, thus ignored
    return df[1:,1:]; 
}

import "/mnt/daphne_work/daphne/scripts/algorithms/decisionTree_.daph";
import "/mnt/daphne_work/daphne/scripts/algorithms/decisionTreePredict_.daph";
import "/mnt/daphne_work/daphne/scripts/algorithms/randomForest_.daph";
import "/mnt/daphne_work/daphne/scripts/algorithms/randomForestPredict_.daph";

###########
# load data
###########

x_valid = read_csv("/mnt/daphne_work/ifat_semiconductors/data/x_valid.csv");
y_valid = read_csv("/mnt/daphne_work/ifat_semiconductors/data/y_valid.csv");

X = as.matrix<f64>(x_valid[0:1000, ]);
y = as.matrix<f64>(y_valid[0:1000, 0]);

###########
# bin data
###########

# inits new x_binned variable
x_binned = bin(as.matrix(X[, 0]), 5);
for (i in 1:104) {
    # starts from 1 on purpose, as 0 was used for init
    x_binned = cbind(x_binned, bin(as.matrix(X[, i]), 50));
}

# recodes y to y_binned
y_recoded, y_dict = recode(as.matrix(y), false);

# Switch from DAPHNE'S 0-based indexing to SystemDS's 1-based indexing.
x_binned = x_binned + 1;
y_recoded = y_recoded + 1;

###########
# define data types
###########

R = fill(1, 1, ncol(x_binned)+1); # needs to include y ctype at last position
R[, ncol(R) - 1] = as.matrix(2); # overwrites y ctype to be categorical (=2) instead of ordinal (=1)

###########
# train model
###########

# INPUT:
# ------------------------------------------------------------------------------
# X               Feature matrix in recoded/binned representation
# y               Label matrix in recoded/binned representation
# ctypes          Row-Vector of column types [1 scale/ordinal, 2 categorical]
#                 of shape 1-by-(ncol(X)+1), where the last entry is the y type
# max_depth       Maximum depth of the learned tree (stopping criterion)
# min_leaf        Minimum number of samples in leaf nodes (stopping criterion),
#                 odd number recommended to avoid 50/50 leaf label decisions
# min_split       Minimum number of samples in leaf for attempting a split
# max_features    Parameter controlling the number of features used as split
#                 candidates at tree nodes: m = ceil(num_features^max_features)
# max_values      Parameter controlling the number of values per feature used
#                 as split candidates: nb = ceil(num_values^max_values)
# impurity        Impurity measure: entropy, gini (default), rss (regression)
# seed            Fixed seed for randomization of samples and split candidates
# verbose         Flag indicating verbose debug output

# to be conform with the expected datatypes
X = as.matrix<f64>(x_binned);
y = as.matrix<f64>(y_recoded);
R = as.matrix<f64>(R);

yhat = fill(0.0, 0, 0);
# num decision trees in random forest
dt = 2; 

# encountered issues
# - dt=1: [error]: Execution error: reshape must retain the number of cells
# - dt=2: [error]: Execution error: CondMatMatMat: condition/then/else matrices must have the same shape
#                  when commenting out the predict call: [error]: Execution error: EwBinaryMat(Dense) - lhs and rhs must either have the same dimensions, or one of them must be a row/column vector with the width/height of the other
#  - dt=3: [error]: Execution error: sel must have exactly one entry (row) for each row in arg

maxV = 1.0; # max values

if( dt==1 ) {
  M = decisionTree_.decisionTree(
    /*X=*/X, /*y=*/y, /*ctypes=*/R,
    /*max_depth=*/10, /*min_leaf=*/4, /*min_split=*/10,
    /*max_features=*/1.0, /*max_values=*/maxV,
    /*impurity=*/"gini", /*seed=*/7, /*verbose=*/true
  );
  yhat = decisionTreePredict_.decisionTreePredict(
    /*X=*/X, /*ctypes=*/R, /*M=*/M,
    /*strategy=*/"TT", /*verbose=*/true
  );
}
else {
  sf = 1.0/(dt - 1); # sample fraction
  M = randomForest_.randomForest(
    /*X=*/X, /*y=*/y, /*ctypes=*/R,
    /*num_trees=*/dt - 1, /*sample_frac=*/sf, 
    /*feature_frac=*/1.0, /*max_depth=*/10, /*min_leaf=*/4, /*min_split=*/10,
    /*max_features=*/1.0, /*max_values=*/maxV,
    /*impurity=*/"gini", /*seed=*/7, /*verbose=*/true
  );
  yhat = randomForestPredict_.randomForestPredict(
    /*X=*/X, /*y=*/y, /*ctypes=*/R, /*M=*/M, /*verbose=*/true
  );
}

acc = mean(yhat == y);
print(acc);
pdamme commented 7 months ago

Dear @bl1zzardx, thanks for reporting this issue and for the detailed description; and sorry for the delay. I'm able to reproduce these errors on my system (after adjusting the paths in the DaphneDSL script above).

Your DaphneDSL script looks good to me and should work. However, there seems to be a bug unrelated to decision trees and random forests.

As a quick fix, please don't use your function read_csv, but inline it manually by replacing the lines

x_valid = read_csv("/mnt/daphne_work/ifat_semiconductors/data/x_valid.csv");
y_valid = read_csv("/mnt/daphne_work/ifat_semiconductors/data/y_valid.csv");

by

x_valid = readMatrix("/mnt/daphne_work/ifat_semiconductors/data/x_valid.csv")[1:, 1:];
y_valid = readMatrix("/mnt/daphne_work/ifat_semiconductors/data/y_valid.csv")[1:, 1:];

which is essentially what read_csv does (apart from the print, but that does not cause the problem, as far as I can tell).

With this change, the script runs smoothly for dt = 1, 2, 3 on my system and reports an accuracy of approximately 97% in all three cases.


For some background: After a quick investigation, it looks to me like there is a bug related to shape inference and the specialization of user-defined functions. The problem can be triggered with much simpler scripts, as I found out now. I will create a separate issue for it and we will fix this bug.

bl1zzardx commented 7 months ago

hi, that solved the problem, thank you!