TheDigitalFrontier / parallel-decision-trees

Semester project in CS205 Computing Foundations for Computational Science at Harvard School of Engineering and Applied Sciences, spring 2020.
MIT License
3 stars 1 forks source link

Fix behavior for values that are exactly equal to threshold. #43

Closed gpestre closed 4 years ago

gpestre commented 4 years ago

https://github.com/johannes-kk/cs205_final_project/blob/fd4279679b29c17fd3e6f5f42f934f30e979c1ad/src/datasets.cpp#L406:L414

@johannes-kk I saw you had a comment about this. I tried to implement this as a boolean option (defaults to equal values going left) but it seems to fail on this test: https://github.com/johannes-kk/cs205_final_project/blob/fd4279679b29c17fd3e6f5f42f934f30e979c1ad/src/test_datasets.cpp#L49:L53

gpestre commented 4 years ago

We could use some hard-coded tolerance for comparing doubles.

johannes-kk commented 4 years ago

We're talking about which observations are piped to left vs right group, and the case where the split feature observation value is exactly equal to the threshold value, right? If so, I suggest we just do if value <= threshold: go left; else go right; to make it simple.

Keeps it simple, is consistent with many other implementations I found (e.g. Machine Learning Algorithms from Scratch in Python by Jason Brownlee), and would remove the need for a tolerance when comparing doubles?

gpestre commented 4 years ago

Thanks for looking into it. Agreed about using <=. But strangely it seemed to still send the threshold value to the wrong side when I tried >= so I wonder if someone unexpected is happening with float comparison?

johannes-kk commented 4 years ago

That's odd. Did it happen unexpectedly for a specific set of values (presumably from the contrived dataset)?

gpestre commented 4 years ago

Here's an example:

bug_43.cpp :

#include <iostream>
#include "../src/datasets.cpp"

int main(){

    DataFrame df = DataFrame({
        {2.7810836,2.550537003,0}, 
        {1.465489372,2.362125076,0},
        {7.627531214,2.759262235,1}, 
        {3.396561688,4.400293529,0},
        {8.675418651,-0.242068655,1}, 
        {1.38807019,1.850220317,0}, 
        {3.06407232,3.005305973,0},  
        {5.332441248,2.088626775,1}, 
        {6.922596716,1.77106367,1},  
        {7.673756466,3.508563011,1},
        {7.673756466,3.301233593,1}
    });

    std::cout << "Left split when equal values go left (should include 3.396562):" << std::endl;
    std::cout << *df.split(0,3.396562,true)[0] << std::endl;

    std::cout << "Left split when equal values do not go left (should NOT include 3.396562):" << std::endl;
    std::cout << *df.split(0,3.396562,false)[0] << std::endl;

};

output:

$ g++ -std=c++14 -g3 ../tests/bug_43.cpp -o bug_43
$ ./bug_43
Left split when equal values go left (should include 3.396562):
|  2.781084 |  2.550537 |  0.000000 |
|  1.465489 |  2.362125 |  0.000000 |
|  3.396562 |  4.400294 |  0.000000 |
|  1.388070 |  1.850220 |  0.000000 |
|  3.064072 |  3.005306 |  0.000000 |

Left split when equal values do not go left (should NOT include 3.396562):
|  2.781084 |  2.550537 |  0.000000 |
|  1.465489 |  2.362125 |  0.000000 |
|  3.396562 |  4.400294 |  0.000000 |
|  1.388070 |  1.850220 |  0.000000 |
|  3.064072 |  3.005306 |  0.000000 |
gpestre commented 4 years ago

False alarm -- it works as expected when using the exact threshold of 3.396561688 (my test was unintentionally using a rounded value).

johannes-kk commented 4 years ago

Good that you checked so we resolved it on the right grounds 👍