Data Loading into netsDB

jiazou-bigdata commented 2 years ago

All, I've fixed the data loading into netsDB so that:

(1) It now can handle input files whose number of rows are not fully divided by the block size (blockX);

(2) It now can handle input files whose label column is not the last column. This fix requires to pass a parameter to specify the label column index.

The new command for loading data is:

bin/testDecisionForest Y 2200000 28 275000 0 F A 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv

bin/testDecisionForestWithCrossProduct Y 2200000 28 275000 0 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_10_8_netsdb RandomForest

The parameter 0 specifies that the label column for this file is the column-0.

The complete commands and outputs for Higgs dataset are:

Please recompile with flag: #define MAX_BLOCK_SIZE 275000

Without CrossProduct

scripts/cleanupNode.sh

./scripts/startPseudoCluster.py 8 20000

bin/testDecisionForest Y 2200000 28 275000 0 F A 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv

bin/testDecisionForest N 2200000 28 275000 0 F A 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_10_8_netsdb RandomForest

Output:

output count:2200000 positive count:1033098

With CrossProduct

scripts/cleanupNode.sh

./scripts/startPseudoCluster.py 8 20000

bin/testDecisionForestWithCrossProduct Y 2200000 28 275000 0 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_10_8_netsdb RandomForest

bin/testDecisionForestWithCrossProduct N 2200000 28 275000 0 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_10_8_netsdb RandomForest 10

Output:

total count:2200000 positive count:1033098

The complete commands and outputs for TPCx-AI dataset are:

Please recompile with flag: #define MAX_BLOCK_SIZE 500000

Without CrossProduct

scripts/cleanupNode.sh

./scripts/startPseudoCluster.py 8 20000

bin/testDecisionForest Y 7353840 7 500000 7 F A 32 model-inference/decisionTree/experiments/dataset/tpcxai_fraud_test.csv

bin/testDecisionForest N 7353840 7 500000 7 F A 32 model-inference/decisionTree/experiments/dataset/tpcxai_fraud_test.csv model-inference/decisionTree/experiments/models/tpcxai_fraud_randomforest_10_8_netsdb RandomForest

Output

output count:7353840 positive count:99593

With CrossProduct

scripts/cleanupNode.sh

./scripts/startPseudoCluster.py 8 20000

bin/testDecisionForestWithCrossProduct Y 7353840 7 500000 7 32 model-inference/decisionTree/experiments/dataset/tpcxai_fraud_test.csv model-inference/decisionTree/experiments/models/tpcxai_fraud_randomforest_10_8_netsdb RandomForest

bin/testDecisionForestWithCrossProduct N 7353840 7 500000 7 32 model-inference/decisionTree/experiments/dataset/tpcxai_fraud_test.csv model-inference/decisionTree/experiments/models/tpcxai_fraud_randomforest_10_8_netsdb RandomForest 10

Output

total count:7353840 positive count:99593

venkate5hgunda commented 2 years ago

Checking this 🤚🏻

venkate5hgunda commented 2 years ago

I can reproduce these results, professor. Looks good to me. 👍🏻

asu-cactus / netsdb

Data Loading into netsDB #72