EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.58k stars 1.55k forks source link

Dask now uses njobs instead of all possible cores #1252

Closed perib closed 2 years ago

perib commented 2 years ago

What does this PR do?

Dask now uses n_jobs instead of all possible cores. In dask.compute the parameters is now set as follows: num_workers=self.n_jobs

Also removed the line self.dask_graphs_ = tmp_result_scores since this variable isn't actually used anywhere.

Where should the reviewer start?

How should this PR be tested?

est = tpot.TPOTClassifier(use_dask=True, n_jobs=1) est = tpot.TPOTClassifier(use_dask=True, n_jobs=32)

and observe the CPU usage and time to completion.

Any background context you want to provide?

Note that within a dask client context manager, the client parameters for n_workers is used rather than the n_jobs passed into tpot.

What are the relevant issues?

1223

Screenshots (if appropriate)

Questions:

Might be helpful to demonstrate an example using localcluster, and dask joblib. When within a client context manager, n_jobs in dask.compute is actually ignored in favor of the LocalCluster parameters. Since this behavior is not obvious, I think it would be helpful to include in the docs.

For example:

with LocalCluster(threads_per_worker=32, n_workers=1, processes=False) as cluster:
    with Client(cluster) as client:
        with dask.distributed.performance_report():

            start = time.time()
            est = tpot.TPOTClassifier(use_dask=True)
            est.fit(X,y)
            t1 = time.time() - start
            print(t1)