Laurae2 / Laurae

Advanced High Performance Data Science Toolbox for R by Laurae
204 stars 50 forks source link

Training & testing #12

Open ssobel opened 6 years ago

ssobel commented 6 years ago

Hello Laurae, thanks for your response earlier about my question about emulating daForest.

This time I have a question sort of related to validation_data=NULL #6, in that I want to make sure I understand how to properly do training & testing to try to avoid overfitting. I tried running CascadeForest and got excellent results on training and held out validation data (where I knew the labels), but when I applied the model to test data (exclusive of my train & validation data, and where I did not know the labels but the contest website gave me my score), the model did not perform very well. So, I believe I am overfitting.

Basically, I trained CascadeForest using d_train & d_valid like this:

CascadeForest(training_data = d_train, validation_data = d_valid, training_labels = labels_train, validation_labels = labels_valid, ...)

Where: d_train & labels_train = predictor columns & known labels (65% of my total training data) d_valid & labels_valid = predictor columns & known labels, exclusive of d_train (the other 35% of my total training data).

My AUC was something like 0.96 when I then predicted on d_train and also when I predicted on d_valid. So, that made me happy, and I then applied the predict function to d_test, which is exclusive of d_train and d_valid, and where I don't know the true labels, but I submitted my predictions to the contest website and got a 0.75 AUC, not nearly as good as 0.96.

So that made me think I should use cross-validation in CascadeForest, like this:

CascadeForest(training_data = d_alltrain, validation_data = NULL, training_labels = labels_alltrain, validation_labels = NULL, ...)

Where: d_alltrain is all my training data (= 65% + 35% = 100%), and labels_alltrain is all my known labels for all my training data.

But, I got the error as noted in validation_data=NULL #6. I have not yet tried the solution you suggest to fix the lines of code to work for cross-validation, but is this the proper way to do cross-val? Then if I get a good AUC indicated by the model (on cross-val d_alltrain), and then apply the model to d_test then that is the proper way to try to avoid overfitting and I should hope for a better score?

Thank you very much.

Laurae2 commented 6 years ago

@ssobel The issue with putting an optimizer (here, boosting) on top of an optimizer, is that the validation set gets overfitted (the model overfits heavily the training set and a lot the validation set).

This is also why you are seeing great performance on the validation set, but poor performance on testing set. Unfortunately, there are no known workaround for it theoretically: it is bound to overfit the validation set. Well, that's in theory, because practice says otherwise.

In practice, we use nested cross-validation for such scenario. This also means it gets computationally expensive (for instance a 5x 5-fold cross-validation means 25 validations and 25 test sets... it parallelizes very well linearly with the number of cores available if there is enough RAM).

Try the following, a 5x 5-fold cross-validation:

ssobel commented 6 years ago

Hi Laurae, thank you for responding. I am not sure I completely understand, would you mind clarifying? Very much appreciated in advance.

My data is:

If I interpreting your message above correctly, then I should do this: train_data = random 64% of my d_alltrain, with its labels defined as "train_labels" validation_data = random 16% of my d_alltrain (exclusive of train_data), with its labels defined as "validation_labels" test_data = the remaining 20% of my d_alltrain (exclusive of train_data & validation_data), with its labels defined as "test_labels"

Then:

folds <- kfold(train_labels, k = 5)

But can you help me understand how to set up the model(s)?

model <- CascadeForest(training_data = ?, validation_data = ?, training_labels = ?, validation_labels = ?, folds = folds, ...)

Thanks again.