dswah / pyGAM

[HELP REQUESTED] Generalized Additive Models in Python
https://pygam.readthedocs.io
Apache License 2.0
851 stars 156 forks source link

add joblib #124

Open jeweinberg opened 6 years ago

jeweinberg commented 6 years ago

Is there a plan to add joblib into the project? It would be nice to be able to set n_jobs for each of the algorithms similar to sklearn.

dswah commented 6 years ago

yes! i would love to add that for parallelizing the gridsearch and bootstrapping tasks.

i've only messed around with it a little bit with Multiprocessing (https://docs.python.org/2/library/multiprocessing.html) and Pathos (https://github.com/uqfoundation/pathos) but i've been running into problems getting the arguments to pickle/dill correctly, and other basic things.

i've left it on the back-burner for now, since i think there are bigger fish to fry, but if you are knowledgeable about it, i would certainly welcome a PR or example :)

jeweinberg commented 6 years ago

The concept is fairly simple. You just have a function that you pass through parallel.

from math import sqrt from joblib import Parallel, delayed Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

dswah commented 6 years ago

for some reason i don't think i've tried joblib...

in practice i've run into implementations issues (some stemming from OSX) with the various fork types (fork, forkserver, spawn, etc), executing numpy code in the forked process, and pickling the objects that get sent to the child processes...

i'll give this a shot.

h4gen commented 5 years ago

@dswah I highly recommend focusing on joblib for parallelism! IT is supported by the whole pydata stack and Dask ist able to provide a backend for joblib. Ergo using joblib gives pyGAM the ability to be distributed on thousands of worker nodes, making it a potential big data tool. Read this for more info. Joblib is also really easy to use for programers as @jeweinberg pointed out.

dswah commented 5 years ago

@jeweinberg @h4gen thanks for the tips.

i am adding joblib for concurrent execution and out-of-core learning.

do you all know if it is necessary to add the partial_fit() method for distributed fitting with dask?

h4gen commented 5 years ago

@dswah It should not be necessary as far as I understand it. The usage of Parallel and delayed from joblib should be sufficient when dask is used as backend. Dask distributes the data as well as the jobs all by itself. dask devs may correct me if I am wrong.