dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
885 stars 255 forks source link

Enhanced cross-validation inputs for Dask-ML HyperbandSearchCV #940

Open rochelle-worsnop opened 1 year ago

rochelle-worsnop commented 1 year ago

Hello, Is it possible to add the same possible "cv" inputs for Dask-ML HyperbandSearchCV that are currently available for Dask-ML RandomizedSearchCV and Dask-ML GridSearchCV? HyperbandCV currently only allows inputs for test_size, but I would like to be able to define my own cross-validation iterable which would contain the indicies of my training and validation splits for model tuning and evaluation. RandomizedSearchCV and GridSearchCV both allow this sort of cross validation input as seen below.

image

Thank you!

TomAugspurger commented 1 year ago

I'm not sure offhand. cc @stsievert if you have thoughts.

stsievert commented 1 year ago

Thanks for the ping @rochelle-worsnop. Technically, accepting fancier CV splits is possible and the implementation isn't too complex (but is not simple).

Why do you want a CV split with Hyperband? Hyperband best fits in cases that are computationally constrained as per the documentation, which I presume means that more computation with fancier cv splits is not required. HyperbandSearchCV is basically RandomizedSearchCV with early stopping of poor performing models. What's your use case, and why not use RandomizedSearchCV?

rochelle-worsnop commented 1 year ago

Thanks for the reply! I tried to give the implementation a go and unfortunately wasn't able to make it work given my limited object-oriented coding experience. I'm currently using dask RandomizedSearchCV, but thought that HyperbandSearchCVwould enable me to test a larger sample of hyperparameter combinations in my search space at approximately the same amount of time as a smaller sample with RandomizedSearchCVsince it would stop tuning combinations if they resulted in a poor model. Essentially I saw HyperbandSearchCV as a time saver that would also allow me to use some of that saved time to try to "find" the best model hyperparameters by testing more combos overall. I searched through 200 possible hyperparameter combinations (maybe that's too many?) with randomized search and that took about 5.5 hours with all of the computing resources I have access to. I have ~700 different models (e.g., a different model for each month, forecast lead time, and one for each of 20 years) that I need to perform hyperparameter tuning with in the end, so any time saved would be helpful. I guess my case isn't computationally restrained in regards to tuning one model but it is given the total number of models that I need to tune for my project. Do you think my case would still be a good implementation of HyperbandSearchCV? Either way, I think I need to be able to specify a cross validation iterable.

stsievert commented 1 year ago

Thanks for your use case @rochelle-worsnop. I think 700 models counts as computationally constrained.

I might be able to hack together a solution for your desired use case, but I'm not sure it's necessary.

rochelle-worsnop commented 1 year ago

@stsievert How many data do you have, and how many features does each example have? I'm working with weather forecast data. To tune one model, I have 252 normalized examples in the time dimension and 3371 in the spatial dimension. I pool the time and space dimension together to have 849,492 total examples to train with (before any splitting into train-validation folds). The test set uses completely different examples. Each example has 3 features for now, but one of the goals of my project is to test the performance of adding additional features (maybe around 10 total). 

What model are you using, and what hyperparameters are you tuning? a fairly standard and shallow ANN with regularization and also a CNN. I haven't tried tuning the CNN yet, but this is what I'm trying for the ANN:

max_nlayers = 3 #define max number of layers
n_neuronsL = [x for x in itertools.product((10,20,30),repeat=max_nlayers)] #define number of neurons in each 
                                                                    #potential layer

num of layers: stats.randint(1,max_nlayers+1) num of neurons in the layers: stats.randint(0,len(n_neuronsL)) regularization parameter: stats.loguniform(10.**-6,10.**-2) activation function: ['elu','relu','tanh'] Adam initial learning rate: [0.0001,0.0003,0.001,0.003,0.01,0.03,0.1,0.3] number of epochs: stats.randint(10,300) batch size: [32,62,128,256,512,1024,2048,4096,8192]

Why not tune the hyperparameters for one model, and use that for all 700? Why not tune one model (or 12 models) and use that to narrow down the search space? Hmm this is an interesting thought, I'm not sure of the answer to it. It's possible that 12 tuned models (one for each month) may be suitable, but I'm not sure. I'll think about this some more. This is my first NN project, so I'm still not sure what's all acceptable to do yet.

Why not use one train/test/validation split? That's (very) common in deep learning, which has access to many data and is very computationally constrained. I do have access to a lot of data, except in the time dimension. I only have 250 examples, so I wanted to use 5-fold cross validation on the time dimension and train on 200 examples (+ the spatial data) and validate the tuned model with the remaining 50 time examples (+ spatial data). Cycling through those folds for training and validation would allow me to evaluate on every bit of my data and would ideally give me a more robust estimate of how well the tuned model will generalize to unseen data at a different time. It's a bit convoluted, but I'm essentially trying to replicate the method in a recently published research article: Sect 3b if you're interested.

I might be able to hack together a solution for your desired use case, but I'm not sure it's necessary. Thanks, that's very nice! If you'd like to talk more through it, I'd be happy to schedule a brief Google Meet or something like that to ensure it's not a waste of your time. 

stsievert commented 1 year ago

Thanks for the explanation @rochelle-worsnop! That makes sense. Your use case sounds very computationally constrained, especially with the number of hyperparameters (I might go with fewer given some prior experience).

Here's my hacky implementation:

Implementation of CVModel ``` python from sklearn.linear_model import SGDRegressor from dask.distributed import get_client from sklearn.model_selection import train_test_split from dask.distributed import Client from sklearn.datasets import load_iris from sklearn.model_selection import KFold from sklearn.base import BaseEstimator class CVModel(BaseEstimator): def __init__(self, cv=5, **kwargs): self.cv = cv self.kwargs = kwargs def _init(self): self._inited = True self.k_fold_ = KFold(n_splits=self.cv) self.models_ = [ SGDRegressor(random_state=i, **self.kwargs) for i in range(self.cv) ] def partial_fit(self, X_train, y_train): if not hasattr(self, "_inited"): self._init() client = get_client() # eek! futures = [client.submit(model.partial_fit, X_train[train_idx], y_train[train_idx]) for model, (train_idx, test_idx) in zip(self.models_, self.k_fold_.split(X_train))] self.models_ = client.gather(futures) return self def score(self, X_val, y_val): if not hasattr(self, "_inited"): self._init() client = get_client() # eek! futures = [client.submit(model.score, X_test[test_idx], y_test[test_idx]) for model, (train_idx, test_idx) in zip(self.models_, self.k_fold_.split(X_test))] scores = client.gather(futures) return sum(scores) / len(scores) # mean score for this model if __name__ == "__main__": client = Client() X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, random_state=10) m = CVModel() m.partial_fit(X_train, y_train) print(m.score(X_val, y_val)) ```

This implementation isn't integrated with HyperbandSearchCV. I think the biggest issue will be properly selecting the train/test splits, something I'm relatively unfamiliar with (I one random train/test/val split all the time). I suspect the cross validation will easiest with a dataframe.

rochelle-worsnop commented 1 year ago

Thanks, @stsievert!! This is really helpful. Is the idea that I would use this CVModel instead of HyperbandSearchCV or try to incorporate it with HyperbandSearchCV? Would it be able to stop training poor performing models?

stsievert commented 1 year ago

CVModel can be used with HyperbandSearchCV, instead of whatever model you're currently using (e.g., MLPClassifier). It's a hacky method to start with cross validation; it we'll depend a lot on crafting the dataset, and the chunks of the Dask Array.

rochelle-worsnop commented 1 year ago

Ah, I see. That makes sense. I appreciate your help! I'll see if I can get it implemented.

stsievert commented 1 year ago

How's it going @rochelle-worsnop?

rochelle-worsnop commented 1 year ago

Hi @stsievert https://github.com/stsievert thanks for checking back. I ended up having to go a different direction for this project, so I ended up not using DaskHyperbandSearchCV. I wasn't able to figure out how to implement it for my project, but I'll probably give it another try for an upcoming project that I think it will be useful for as well. Thanks for all of your help so far!

On Sun, Jan 15, 2023 at 5:37 PM Scott Sievert @.***> wrote:

How's it going @rochelle-worsnop https://github.com/rochelle-worsnop?

— Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/940#issuecomment-1383304595, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASZMYNGJZOPJEUI5IHPYC5LWSSJ2XANCNFSM6AAAAAAQMRWEZY . You are receiving this because you were mentioned.Message ID: @.***>

--

Rochelle Worsnop (she/her/hers) Research Scientist Cooperative Institute for Research in Environmental Sciences (CIRES) at the NOAA/Earth System Research Laboratories/Physical Sciences Laboratory