All, I've fixed the data loading into netsDB so that:
(1) It now can handle input files whose number of rows are not fully divided by the block size (blockX);
(2) It now can handle input files whose label column is not the last column. This fix requires to pass a parameter to specify the label column index.
The new command for loading data is:
bin/testDecisionForest Y 2200000 28 275000 0 F A 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv
bin/testDecisionForestWithCrossProduct Y 2200000 28 275000 0 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_10_8_netsdb RandomForest
The parameter 0 specifies that the label column for this file is the column-0.
The complete commands and outputs for Higgs dataset are:
Please recompile with flag: #define MAX_BLOCK_SIZE 275000
Without CrossProduct
scripts/cleanupNode.sh
./scripts/startPseudoCluster.py 8 20000
bin/testDecisionForest Y 2200000 28 275000 0 F A 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv
bin/testDecisionForest N 2200000 28 275000 0 F A 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_10_8_netsdb RandomForest
Output:
output count:2200000
positive count:1033098
With CrossProduct
scripts/cleanupNode.sh
./scripts/startPseudoCluster.py 8 20000
bin/testDecisionForestWithCrossProduct Y 2200000 28 275000 0 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_10_8_netsdb RandomForest
The complete commands and outputs for TPCx-AI dataset are:
Please recompile with flag: #define MAX_BLOCK_SIZE 500000
Without CrossProduct
scripts/cleanupNode.sh
./scripts/startPseudoCluster.py 8 20000
bin/testDecisionForest Y 7353840 7 500000 7 F A 32 model-inference/decisionTree/experiments/dataset/tpcxai_fraud_test.csv
bin/testDecisionForest N 7353840 7 500000 7 F A 32 model-inference/decisionTree/experiments/dataset/tpcxai_fraud_test.csv model-inference/decisionTree/experiments/models/tpcxai_fraud_randomforest_10_8_netsdb RandomForest
Output
output count:7353840
positive count:99593
With CrossProduct
scripts/cleanupNode.sh
./scripts/startPseudoCluster.py 8 20000
bin/testDecisionForestWithCrossProduct Y 7353840 7 500000 7 32 model-inference/decisionTree/experiments/dataset/tpcxai_fraud_test.csv model-inference/decisionTree/experiments/models/tpcxai_fraud_randomforest_10_8_netsdb RandomForest
All, I've fixed the data loading into netsDB so that:
(1) It now can handle input files whose number of rows are not fully divided by the block size (blockX);
(2) It now can handle input files whose label column is not the last column. This fix requires to pass a parameter to specify the label column index.
The new command for loading data is:
bin/testDecisionForest Y 2200000 28 275000 0 F A 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv
bin/testDecisionForestWithCrossProduct Y 2200000 28 275000 0 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_10_8_netsdb RandomForest
The parameter 0 specifies that the label column for this file is the column-0.
The complete commands and outputs for Higgs dataset are:
Please recompile with flag: #define MAX_BLOCK_SIZE 275000
Without CrossProduct
scripts/cleanupNode.sh
./scripts/startPseudoCluster.py 8 20000
bin/testDecisionForest Y 2200000 28 275000 0 F A 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv
bin/testDecisionForest N 2200000 28 275000 0 F A 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_10_8_netsdb RandomForest
Output:
output count:2200000 positive count:1033098
With CrossProduct
scripts/cleanupNode.sh
./scripts/startPseudoCluster.py 8 20000
bin/testDecisionForestWithCrossProduct Y 2200000 28 275000 0 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_10_8_netsdb RandomForest
bin/testDecisionForestWithCrossProduct N 2200000 28 275000 0 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_10_8_netsdb RandomForest 10
Output:
total count:2200000 positive count:1033098
The complete commands and outputs for TPCx-AI dataset are:
Please recompile with flag: #define MAX_BLOCK_SIZE 500000
Without CrossProduct
scripts/cleanupNode.sh
./scripts/startPseudoCluster.py 8 20000
bin/testDecisionForest Y 7353840 7 500000 7 F A 32 model-inference/decisionTree/experiments/dataset/tpcxai_fraud_test.csv
bin/testDecisionForest N 7353840 7 500000 7 F A 32 model-inference/decisionTree/experiments/dataset/tpcxai_fraud_test.csv model-inference/decisionTree/experiments/models/tpcxai_fraud_randomforest_10_8_netsdb RandomForest
Output
output count:7353840 positive count:99593
With CrossProduct
scripts/cleanupNode.sh
./scripts/startPseudoCluster.py 8 20000
bin/testDecisionForestWithCrossProduct Y 7353840 7 500000 7 32 model-inference/decisionTree/experiments/dataset/tpcxai_fraud_test.csv model-inference/decisionTree/experiments/models/tpcxai_fraud_randomforest_10_8_netsdb RandomForest
bin/testDecisionForestWithCrossProduct N 7353840 7 500000 7 32 model-inference/decisionTree/experiments/dataset/tpcxai_fraud_test.csv model-inference/decisionTree/experiments/models/tpcxai_fraud_randomforest_10_8_netsdb RandomForest 10
Output
total count:7353840 positive count:99593