Open jeweinberg opened 6 years ago
yes! i would love to add that for parallelizing the gridsearch and bootstrapping tasks.
i've only messed around with it a little bit with Multiprocessing (https://docs.python.org/2/library/multiprocessing.html) and Pathos (https://github.com/uqfoundation/pathos) but i've been running into problems getting the arguments to pickle/dill correctly, and other basic things.
i've left it on the back-burner for now, since i think there are bigger fish to fry, but if you are knowledgeable about it, i would certainly welcome a PR or example :)
The concept is fairly simple. You just have a function that you pass through parallel.
from math import sqrt from joblib import Parallel, delayed Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
for some reason i don't think i've tried joblib...
in practice i've run into implementations issues (some stemming from OSX) with the various fork types (fork
, forkserver
, spawn
, etc), executing numpy code in the forked process, and pickling the objects that get sent to the child processes...
i'll give this a shot.
@dswah I highly recommend focusing on joblib for parallelism! IT is supported by the whole pydata stack and Dask ist able to provide a backend for joblib. Ergo using joblib gives pyGAM the ability to be distributed on thousands of worker nodes, making it a potential big data tool. Read this for more info. Joblib is also really easy to use for programers as @jeweinberg pointed out.
@jeweinberg @h4gen thanks for the tips.
i am adding joblib for concurrent execution and out-of-core learning.
do you all know if it is necessary to add the partial_fit()
method for distributed fitting with dask?
@dswah It should not be necessary as far as I understand it. The usage of Parallel and delayed from joblib should be sufficient when dask is used as backend. Dask distributes the data as well as the jobs all by itself. dask devs may correct me if I am wrong.
Is there a plan to add joblib into the project? It would be nice to be able to set n_jobs for each of the algorithms similar to sklearn.