asu-cactus / netsdb

A system that seamlessly integrates Big Data processing and machine learning model serving in distributed relational database
Apache License 2.0
15 stars 5 forks source link

support csv input with missing values for netsdb #76

Closed hguan6 closed 1 year ago

hguan6 commented 1 year ago

The C++ code shown on my IDE (VSCode) is very messy so I used autoformat to make it look cleaner, but it resulted in a lot of line changes to the original file. I tested with the following commands

./scripts/cleanupNode.sh
./scripts/startPseudoCluster.py 8 15000

bin/testDecisionForestWithCrossProduct Y 236750 968 5500 968 32 1 model-inference/decisionTree/experiments/dataset/bosch.csv_test.csv model-inference/decisionTree/experiments/models/bosch_xgboost_10_8_netsdb XGBoost 10

bin/testDecisionForestWithCrossProduct N 236750 968 5500 968 32 1 model-inference/decisionTree/experiments/dataset/bosch.csv_test.csv model-inference/decisionTree/experiments/models/bosch_xgboost_10_8_netsdb XGBoost 10

I know this is one of the worst settings in the world, but it worked.

The actual changes in this commit: DataTypes.h: add isMissingTrackLeft to struct Node Tree.h: set tree[nodeID].isMissTrackLeft = true in processInnerNodes() and processLeafNodes(); change isMissingTrackLeft in processRelations(); combine two prediction loops in predict(); add logic for missing values in predict(). FFMatrixUtils.cc: modify the loop to process each line (starting from line 373): first, combine the remainder with the main loop body; add logic to set value to std::nan("") for missing values.

jiazou-bigdata commented 1 year ago

Let me check.