Closed SSLPP closed 3 years ago
Thanks for looking carefully in the code! Im not sure whether I understand your question. Let me first explain how the splits are made at the moment.
1: We have input data X 2: X is split into 80% train (self.X) and 20% validation (self.X_val) 3: From this point on, the code only refers to self.X
Hi, thanks for looking into this.
Assume the input shape of X is 1000 x *
Our Test:Validation:Train split is 0.2:0.2:0.6
Splitting the validation set
Splitting the test+tran set New X.shape = 800
What we have now: X_Test = 0.2 X 800 = 160 (i.e, 16% of the data insted on 20% from 0.2:0.2:0.6 split) X_Train = 0.6X800 = 640 (i.e, 64% of the data insted on 60% from 0.2:0.2:0.6 split)
What it Ideally should be: X_Test = 0.2(1/0.8)800 = 200 (ie. 20% of the data) X_Train = 0.6(1/0.8)800 = 600 (ie. 60% of the data)
Also, Ideally, the validation set should be used for hyperparameter selection and test set for testing. Your notation is clear from the GitHub page but could lead to future confusion.
Thanks for your work.
Agree.
Update with:
pip install -U hgboost
Check the version, should be >= 0.1.6
import hgboost
print(hgboost.__version__)
Shouldn't be the new test-train split be
test_size=self.test_size/(1-self.val_size)
indef _HPOpt(self):
. We updated the shape of X in_set_validation_set(self, X, y)
I'm assuming that the test, train, and validation set ratios are defined on the original data.