Closed gpestre closed 4 years ago
We could use some hard-coded tolerance for comparing doubles.
We're talking about which observations are piped to left vs right group, and the case where the split feature observation value is exactly equal to the threshold value, right? If so, I suggest we just do if value <= threshold: go left; else go right;
to make it simple.
Keeps it simple, is consistent with many other implementations I found (e.g. Machine Learning Algorithms from Scratch in Python by Jason Brownlee), and would remove the need for a tolerance when comparing doubles?
Thanks for looking into it. Agreed about using <=
. But strangely it seemed to still send the threshold value to the wrong side when I tried >=
so I wonder if someone unexpected is happening with float comparison?
That's odd. Did it happen unexpectedly for a specific set of values (presumably from the contrived dataset)?
Here's an example:
bug_43.cpp
:
#include <iostream>
#include "../src/datasets.cpp"
int main(){
DataFrame df = DataFrame({
{2.7810836,2.550537003,0},
{1.465489372,2.362125076,0},
{7.627531214,2.759262235,1},
{3.396561688,4.400293529,0},
{8.675418651,-0.242068655,1},
{1.38807019,1.850220317,0},
{3.06407232,3.005305973,0},
{5.332441248,2.088626775,1},
{6.922596716,1.77106367,1},
{7.673756466,3.508563011,1},
{7.673756466,3.301233593,1}
});
std::cout << "Left split when equal values go left (should include 3.396562):" << std::endl;
std::cout << *df.split(0,3.396562,true)[0] << std::endl;
std::cout << "Left split when equal values do not go left (should NOT include 3.396562):" << std::endl;
std::cout << *df.split(0,3.396562,false)[0] << std::endl;
};
output:
$ g++ -std=c++14 -g3 ../tests/bug_43.cpp -o bug_43
$ ./bug_43
Left split when equal values go left (should include 3.396562):
| 2.781084 | 2.550537 | 0.000000 |
| 1.465489 | 2.362125 | 0.000000 |
| 3.396562 | 4.400294 | 0.000000 |
| 1.388070 | 1.850220 | 0.000000 |
| 3.064072 | 3.005306 | 0.000000 |
Left split when equal values do not go left (should NOT include 3.396562):
| 2.781084 | 2.550537 | 0.000000 |
| 1.465489 | 2.362125 | 0.000000 |
| 3.396562 | 4.400294 | 0.000000 |
| 1.388070 | 1.850220 | 0.000000 |
| 3.064072 | 3.005306 | 0.000000 |
False alarm -- it works as expected when using the exact threshold of 3.396561688
(my test was unintentionally using a rounded value).
Good that you checked so we resolved it on the right grounds 👍
https://github.com/johannes-kk/cs205_final_project/blob/fd4279679b29c17fd3e6f5f42f934f30e979c1ad/src/datasets.cpp#L406:L414
@johannes-kk I saw you had a comment about this. I tried to implement this as a boolean option (defaults to equal values going left) but it seems to fail on this test: https://github.com/johannes-kk/cs205_final_project/blob/fd4279679b29c17fd3e6f5f42f934f30e979c1ad/src/test_datasets.cpp#L49:L53