casper-hansen / Nested-Cross-Validation

Nested cross-validation for unbiased predictions. Can be used with Scikit-Learn, XGBoost, Keras and LightGBM, or any other estimator that implements the scikit-learn interface.
MIT License
62 stars 20 forks source link

Is it possible to use separated testing set in the end, on the most successfull model? #12

Closed ghost closed 3 years ago

ghost commented 4 years ago

Hello,

First of all really thanks for this implementation. If you use GridSearchCV() with cross_validate() function, It also implements nested cross validation. But, It has lack of outputs. It is impossible to find which hyperparameters or which model were successfull. You only got your scores.

In your implementationt here are better outputs. Thanks for that. But, I could see neither any predict() function for the best model, Nor any detailed metrics information output. Can I at least output multiple metrics. For example, I need Accuracy, Precision and Recall metrics together.

Regards, Hayriye

casper-hansen commented 4 years ago

Note that GridSearchCV + cross_validate will NOT result in a correct Nested CV algorithm.

I'm currently the only maintainer of this repository, and I don't have much time for maintaining this repository as of now. I will try to get a look at it soon and release another version.

You are welcome to contribute with pull requests.

ghost commented 4 years ago

Thank you for your answer and effort.

Also, allowing pipeline for undersampling / oversampling inside the folds would be usefull.

visimo-dino commented 3 years ago

@casperbh96 You mentioned that "GridSearchCV + cross_validate will NOT result in a correct Nested CV algorithm." Any chance you could elaborate on that? Based on my understanding of nested cross-validation, and my understanding of how the GridSearchCV and cross_validate functions work in scikit-learn, it seems as though it would implement the process correctly-- but I am not an expert on nested CV and would be happy to be proven wrong. Thank you!

ghost commented 3 years ago

Because I needed the metrics and sampling, I had to implement it without using the NestedCV() method proposed here.

Here is my nested cross validation implementation structure. https://drive.google.com/file/d/1lnU8ERqcpFDOY9w3ke-taKWydOiAkMvk/view In each step, the metrics you defined before and the parameters that are applied are stored in to discuss later. I hope it helps you understanding the logic. Unfortunately right now, I have 2 different projects going on and I have no time to contribute. Ask me again this summer :)

casper-hansen commented 3 years ago

@casperbh96 You mentioned that "GridSearchCV + cross_validate will NOT result in a correct Nested CV algorithm." Any chance you could elaborate on that? Based on my understanding of nested cross-validation, and my understanding of how the GridSearchCV and cross_validate functions work in scikit-learn, it seems as though it would implement the process correctly-- but I am not an expert on nested CV and would be happy to be proven wrong. Thank you!

Hi @visimo-dino.

I'm not aware that scikit-learn has been updated since I implemented this nested-cv package. Thus, I'm not accounting for any updates that might have occurred.

The scikit-learn combination uses all data (incorrect) instead of a subset of all data (correct). The inner loop must only train on your training data, and not the testing data. If you use your testing data to select your hyperparameters, then your model becomes "biased".

Please refer to this image below, sourced from ML From Scratch where you can also read more about nested cross-validation and the original paper.

image

Please also read the answers in this stack overflow: https://stackoverflow.com/questions/41127976/confusing-example-of-nested-cross-validation-in-scikit-learn

visimo-dino commented 3 years ago

@casperbh96 You mentioned that "GridSearchCV + cross_validate will NOT result in a correct Nested CV algorithm." Any chance you could elaborate on that? Based on my understanding of nested cross-validation, and my understanding of how the GridSearchCV and cross_validate functions work in scikit-learn, it seems as though it would implement the process correctly-- but I am not an expert on nested CV and would be happy to be proven wrong. Thank you!

Hi @visimo-dino.

I'm not aware that scikit-learn has been updated since I implemented this nested-cv package. Thus, I'm not accounting for any updates that might have occurred.

The scikit-learn combination uses all data (incorrect) instead of a subset of all data (correct). The inner loop must only train on your training data, and not the testing data. If you use your testing data to select your hyperparameters, then your model becomes "biased".

Please refer to this image below, sourced from ML From Scratch where you can also read more about nested cross-validation and the original paper.

image

Please also read the answers in this stack overflow: https://stackoverflow.com/questions/41127976/confusing-example-of-nested-cross-validation-in-scikit-learn

Right, that was my understanding of how nested CV works. What I meant was that I believe the combination of cross_validate and GridSearchCV does actually accomplish this. The call to cross_validate splits the data into, let's say, 3 different folds; then only the data from 2 of those folds get sent to GridSearchCV, which then partitions those 2 folds into, let's say, 5 different folds. After GridSearchCV has finished optimizing the hyperparameters on the 2/3 of the data it was given, the model (with optimal hyperparameters) is then tested on the remaining 1/3 of the data, which was in no way ever involved in the hyperparameter tuning process that happened in the inner CV. It's possible I'm still missing something, but based on the sklearn source code, it seems to me that it does implement the algorithm spec you posted above.

casper-hansen commented 3 years ago

Please read the stack overflow I linked. Closing this thread for now.