Closed MaxBenChrist closed 6 years ago
If your only goal is to monitor progress, you can use either dask.diagnostics.ProgressBar
for local schedulers (threads, processes) or the dashboard for the dask.distributed
scheduler.
Example when using threads/processes:
In [1]: %paste
from sklearn.datasets import load_digits
from sklearn.svm import SVC
import dask_searchcv as dcv
import numpy as np
digits = load_digits()
param_space = {'C': np.logspace(-4, 4, 9),
'gamma': np.logspace(-4, 4, 9),
'class_weight': [None, 'balanced']}
model = SVC(kernel='rbf')
search = dcv.GridSearchCV(model, param_space, cv=3)
## -- End pasted text --
In [2]: from dask.diagnostics import ProgressBar
In [3]: with ProgressBar():
...: search.fit(digits.data, digits.target)
...:
[########### ] | 28% Completed | 8.8s
Supporting the verbose
parameter in general is trickier, but still doable. I have no plans to get to this soon though.
Thank you for the snippet. That is exactly what I was looking for.
Regarding the verbose
parameter. As a first measure: Couldn't you add the progressbar inside the fit
call iff verbose>0
?
The sklearn gridsearch is for verbose=1
returning the number of finished fit calls (the total number is cv
times n_iter
). So, the dask
version with the progress bar would imitate that behaviour.
If I remember right, for verbose=2
sklearn also returns which parameter combinations were finished already. I guess this would be more work to implement with dask
Regarding the verbose parameter. As a first measure: Couldn't you add the progressbar inside the fit call iff verbose>0
The trick is implementing in a way that works both distributed and local. Again, this is doable, but requires a bit of work. Since we already provide ways of monitoring progress both locally and distributed, I'm less likely to implement this in the near future. Of course PRs are welcome :).
The trick is implementing in a way that works both distributed and local. Again, this is doable, but requires a bit of work. Since we already provide ways of monitoring progress both locally and distributed, I'm less likely to implement this in the near future. Of course PRs are welcome :).
Sure! Can you describe how you would implement that? Then I can see what I can do :).
Can you describe how you would implement that?
My first thought would be to dispatch either to the local or distributed progressbar, depending on scheduler. There'd have to be a bit more work to get this to work properly for distributed - I'd probably switch _normalize_scheduler
also do the compute, something like:
def run_scheduler(dsk, keys, scheduler, n_jobs, verbose=0):
"""Determine which scheduler to use, and execute the graph, logging as appropriate"""
pass
Looking at the verbose parameter for scikit-learn though, they implement a higher level of logging than just a progressbar. Since we try to imitate scikit-learn as much as possible, it might be confusing for users if our verbose
parameter acted differently. Not sure.
Is there a reason the current methods of monitoring are insufficient for you?
First of all, I am just starting to get familiar with dask and dask-searchcv. I use the gridsearch on a cluster and my experience is building up at the moment.
Is there a reason the current methods of monitoring are insufficient for you?
I have monitored the gridsearch with the dask bokeh dashboard. Under the panel tasks, I can see the status of the tasks (waiting, reading etc.)
Now, lets say I use cv=5
and n_iter=100
. This means that I have to fit 500 models, but the number of tasks in the dasboard is not 500 but higher. So, the dashboard also contains "supportive" tasks. Probably result aggregation?
It would be nice to have a dashboard that shows: x/500 models already fitted, like the verbose parameter in sklearn does.
I'd argue that the task view provides a better estimate of time remaining than just x/500
models fit, as there's more to do than just fitting each model. It gives you a set of progress bars that indicate all the remaining work.
However, if you just want to see how many models you've fit/have left to fit, you can look for the tasks with label '{estimator_type}-fit-score'
(for non-pipeline estimators) or 'score'
(for pipelines). There will always be cv * n_iter
(or cv * len(param_space)
for GridSearchCV) of these, which you can use to track the number of models fit.
Ooops, I missed your answer @jcrist.
Now, having more experience with dask, I agree with you.
Should we close this issue?
No problem. Closing.
I like the
verbose
parameter to get a feeling how long the grid search will take.If you point me in the right direction, I can also try to submit a pr. However, I do not have any experience with dask so far