dask / dask-searchcv

dask-searchcv is now part of dask-ml: https://github.com/dask/dask-ml
BSD 3-Clause "New" or "Revised" License
240 stars 43 forks source link

Maybe add verbose parameter to "RandomizedSearchCV" ? #51

Closed MaxBenChrist closed 6 years ago

MaxBenChrist commented 7 years ago

I like the verbose parameter to get a feeling how long the grid search will take.

If you point me in the right direction, I can also try to submit a pr. However, I do not have any experience with dask so far

jcrist commented 7 years ago

If your only goal is to monitor progress, you can use either dask.diagnostics.ProgressBar for local schedulers (threads, processes) or the dashboard for the dask.distributed scheduler.

Example when using threads/processes:

In [1]: %paste
    from sklearn.datasets import load_digits
    from sklearn.svm import SVC
    import dask_searchcv as dcv
    import numpy as np

    digits = load_digits()

    param_space = {'C': np.logspace(-4, 4, 9),
                   'gamma': np.logspace(-4, 4, 9),
                   'class_weight': [None, 'balanced']}

    model = SVC(kernel='rbf')
    search = dcv.GridSearchCV(model, param_space, cv=3)

## -- End pasted text --

In [2]: from dask.diagnostics import ProgressBar

In [3]: with ProgressBar():
   ...:     search.fit(digits.data, digits.target)
   ...:
[###########                             ] | 28% Completed |  8.8s

Supporting the verbose parameter in general is trickier, but still doable. I have no plans to get to this soon though.

MaxBenChrist commented 7 years ago

Thank you for the snippet. That is exactly what I was looking for.

Regarding the verbose parameter. As a first measure: Couldn't you add the progressbar inside the fit call iff verbose>0?

The sklearn gridsearch is for verbose=1 returning the number of finished fit calls (the total number is cv times n_iter). So, the dask version with the progress bar would imitate that behaviour.

If I remember right, for verbose=2 sklearn also returns which parameter combinations were finished already. I guess this would be more work to implement with dask

jcrist commented 7 years ago

Regarding the verbose parameter. As a first measure: Couldn't you add the progressbar inside the fit call iff verbose>0

The trick is implementing in a way that works both distributed and local. Again, this is doable, but requires a bit of work. Since we already provide ways of monitoring progress both locally and distributed, I'm less likely to implement this in the near future. Of course PRs are welcome :).

MaxBenChrist commented 7 years ago

The trick is implementing in a way that works both distributed and local. Again, this is doable, but requires a bit of work. Since we already provide ways of monitoring progress both locally and distributed, I'm less likely to implement this in the near future. Of course PRs are welcome :).

Sure! Can you describe how you would implement that? Then I can see what I can do :).

jcrist commented 7 years ago

Can you describe how you would implement that?

My first thought would be to dispatch either to the local or distributed progressbar, depending on scheduler. There'd have to be a bit more work to get this to work properly for distributed - I'd probably switch _normalize_scheduler also do the compute, something like:

def run_scheduler(dsk, keys, scheduler, n_jobs, verbose=0):
    """Determine which scheduler to use, and execute the graph, logging as appropriate"""
    pass

Looking at the verbose parameter for scikit-learn though, they implement a higher level of logging than just a progressbar. Since we try to imitate scikit-learn as much as possible, it might be confusing for users if our verbose parameter acted differently. Not sure.

Is there a reason the current methods of monitoring are insufficient for you?

MaxBenChrist commented 7 years ago

First of all, I am just starting to get familiar with dask and dask-searchcv. I use the gridsearch on a cluster and my experience is building up at the moment.

Is there a reason the current methods of monitoring are insufficient for you?

I have monitored the gridsearch with the dask bokeh dashboard. Under the panel tasks, I can see the status of the tasks (waiting, reading etc.)

Now, lets say I use cv=5 and n_iter=100. This means that I have to fit 500 models, but the number of tasks in the dasboard is not 500 but higher. So, the dashboard also contains "supportive" tasks. Probably result aggregation?

It would be nice to have a dashboard that shows: x/500 models already fitted, like the verbose parameter in sklearn does.

jcrist commented 7 years ago

I'd argue that the task view provides a better estimate of time remaining than just x/500 models fit, as there's more to do than just fitting each model. It gives you a set of progress bars that indicate all the remaining work.

However, if you just want to see how many models you've fit/have left to fit, you can look for the tasks with label '{estimator_type}-fit-score' (for non-pipeline estimators) or 'score' (for pipelines). There will always be cv * n_iter (or cv * len(param_space) for GridSearchCV) of these, which you can use to track the number of models fit.

MaxBenChrist commented 6 years ago

Ooops, I missed your answer @jcrist.

Now, having more experience with dask, I agree with you.

Should we close this issue?

jcrist commented 6 years ago

No problem. Closing.