TheDigitalFrontier / parallel-decision-trees

Semester project in CS205 Computing Foundations for Computational Science at Harvard School of Engineering and Applied Sciences, spring 2020.
MIT License
3 stars 1 forks source link

AssertionError when running random_forest script on hmeq data #114

Closed hgupta18 closed 4 years ago

hgupta18 commented 4 years ago

I got Assertion failed: (i<this->size()), function value, file ../speedup/../src-openmp/datasets.cpp, line 46. when I ran the speedup/rf_serial_hmeq.cpp or speedup/rf_openmp_hmeq.cpp scripts.

johannes-kk commented 4 years ago

Does the dataset have any missing values? If the dataset is successfully imported into a DataFrame then missingness likely isn’t the cause.

johannes-kk commented 4 years ago

It seems to be a data issue. It's not tied to number of rows, as simply copying Sonar several times to make about 2,000 rows still works with any number of trees.

johannes-kk commented 4 years ago

Nevermind, seems to be an issue with DecisionTree.findBestSplit(). Running the Sonar dataset artificially made into 2,000 rows (by copying it several times) with 5,000 trees causes /speedup/rf_openmp.cpp to fail after about 40min with:

Assertion failed: (this->isFitted()), function predict, file ../speedup/../src-openmp/decision_tree.cpp, line 396.
Accuracy train: Abort trap: 6

It is, in other words, a problem caused by the combination of ntrees and dataset length being "too big" in one way or another.