Closed ghost closed 3 years ago
Note that GridSearchCV + cross_validate will NOT result in a correct Nested CV algorithm.
I'm currently the only maintainer of this repository, and I don't have much time for maintaining this repository as of now. I will try to get a look at it soon and release another version.
You are welcome to contribute with pull requests.
Thank you for your answer and effort.
Also, allowing pipeline for undersampling / oversampling inside the folds would be usefull.
@casperbh96 You mentioned that "GridSearchCV + cross_validate will NOT result in a correct Nested CV algorithm." Any chance you could elaborate on that? Based on my understanding of nested cross-validation, and my understanding of how the GridSearchCV
and cross_validate
functions work in scikit-learn, it seems as though it would implement the process correctly-- but I am not an expert on nested CV and would be happy to be proven wrong. Thank you!
Because I needed the metrics and sampling, I had to implement it without using the NestedCV() method proposed here.
Here is my nested cross validation implementation structure. https://drive.google.com/file/d/1lnU8ERqcpFDOY9w3ke-taKWydOiAkMvk/view In each step, the metrics you defined before and the parameters that are applied are stored in to discuss later. I hope it helps you understanding the logic. Unfortunately right now, I have 2 different projects going on and I have no time to contribute. Ask me again this summer :)
@casperbh96 You mentioned that "GridSearchCV + cross_validate will NOT result in a correct Nested CV algorithm." Any chance you could elaborate on that? Based on my understanding of nested cross-validation, and my understanding of how the
GridSearchCV
andcross_validate
functions work in scikit-learn, it seems as though it would implement the process correctly-- but I am not an expert on nested CV and would be happy to be proven wrong. Thank you!
Hi @visimo-dino.
I'm not aware that scikit-learn has been updated since I implemented this nested-cv package. Thus, I'm not accounting for any updates that might have occurred.
The scikit-learn combination uses all data (incorrect) instead of a subset of all data (correct). The inner loop must only train on your training data, and not the testing data. If you use your testing data to select your hyperparameters, then your model becomes "biased".
Please refer to this image below, sourced from ML From Scratch where you can also read more about nested cross-validation and the original paper.
Please also read the answers in this stack overflow: https://stackoverflow.com/questions/41127976/confusing-example-of-nested-cross-validation-in-scikit-learn
@casperbh96 You mentioned that "GridSearchCV + cross_validate will NOT result in a correct Nested CV algorithm." Any chance you could elaborate on that? Based on my understanding of nested cross-validation, and my understanding of how the
GridSearchCV
andcross_validate
functions work in scikit-learn, it seems as though it would implement the process correctly-- but I am not an expert on nested CV and would be happy to be proven wrong. Thank you!Hi @visimo-dino.
I'm not aware that scikit-learn has been updated since I implemented this nested-cv package. Thus, I'm not accounting for any updates that might have occurred.
The scikit-learn combination uses all data (incorrect) instead of a subset of all data (correct). The inner loop must only train on your training data, and not the testing data. If you use your testing data to select your hyperparameters, then your model becomes "biased".
Please refer to this image below, sourced from ML From Scratch where you can also read more about nested cross-validation and the original paper.
Please also read the answers in this stack overflow: https://stackoverflow.com/questions/41127976/confusing-example-of-nested-cross-validation-in-scikit-learn
Right, that was my understanding of how nested CV works. What I meant was that I believe the combination of cross_validate and GridSearchCV does actually accomplish this. The call to cross_validate splits the data into, let's say, 3 different folds; then only the data from 2 of those folds get sent to GridSearchCV, which then partitions those 2 folds into, let's say, 5 different folds. After GridSearchCV has finished optimizing the hyperparameters on the 2/3 of the data it was given, the model (with optimal hyperparameters) is then tested on the remaining 1/3 of the data, which was in no way ever involved in the hyperparameter tuning process that happened in the inner CV. It's possible I'm still missing something, but based on the sklearn source code, it seems to me that it does implement the algorithm spec you posted above.
Please read the stack overflow I linked. Closing this thread for now.
Hello,
First of all really thanks for this implementation. If you use GridSearchCV() with cross_validate() function, It also implements nested cross validation. But, It has lack of outputs. It is impossible to find which hyperparameters or which model were successfull. You only got your scores.
In your implementationt here are better outputs. Thanks for that. But, I could see neither any predict() function for the best model, Nor any detailed metrics information output. Can I at least output multiple metrics. For example, I need Accuracy, Precision and Recall metrics together.
Regards, Hayriye