Test:Validation:Train split

SSLPP commented 3 years ago

Shouldn't be the new test-train split be test_size=self.test_size/(1-self.val_size) in def _HPOpt(self):. We updated the shape of X in _set_validation_set(self, X, y)

I'm assuming that the test, train, and validation set ratios are defined on the original data.

erdogant commented 3 years ago

Thanks for looking carefully in the code! Im not sure whether I understand your question. Let me first explain how the splits are made at the moment.

1: We have input data X 2: X is split into 80% train (self.X) and 20% validation (self.X_val) 3: From this point on, the code only refers to self.X

in def _HPOpt, self.X is split into self.train (80%) and self.test (20%)
Parameter optimization is performed on the sets from point 4 on.

SSLPP commented 3 years ago

Hi, thanks for looking into this.

Assume the input shape of X is 1000 x *

Our Test:Validation:Train split is 0.2:0.2:0.6

Splitting the validation set

Validation set @20% = 200
Test+Train set @80% = 800

Splitting the test+tran set New X.shape = 800

What we have now: X_Test = 0.2 X 800 = 160 (i.e, 16% of the data insted on 20% from 0.2:0.2:0.6 split) X_Train = 0.6X800 = 640 (i.e, 64% of the data insted on 60% from 0.2:0.2:0.6 split)

What it Ideally should be: X_Test = 0.2(1/0.8)800 = 200 (ie. 20% of the data) X_Train = 0.6(1/0.8)800 = 600 (ie. 60% of the data)

Also, Ideally, the validation set should be used for hyperparameter selection and test set for testing. Your notation is clear from the GitHub page but could lead to future confusion.

Thanks for your work.

erdogant commented 3 years ago

Agree.

Update with:

pip install -U hgboost

Check the version, should be >= 0.1.6

import hgboost
print(hgboost.__version__)

erdogant / hgboost

Test:Validation:Train split #2