dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
892 stars 255 forks source link

Incremental SAGA #189

Open mrocklin opened 6 years ago

mrocklin commented 6 years ago

I believe that @fabianp has a proof of concept for an incremental SAGA implementation. @fabianp, can I ask you to push a bit of code, perhaps as a gist, and then link from issue? I would like to make sure that we don't lose this.

stsievert commented 6 years ago

@fabianp could you post that bit of code? I'd like to see.

fabianp commented 6 years ago

I uploaded it here: https://gist.github.com/fabianp/6c946859cf2ab6042ca036663aa17f25

stsievert commented 6 years ago

Thanks @fabianp!

mrocklin commented 6 years ago

I cleaned this up a bit and re-uploaded it here:

https://gist.github.com/3f64e82dd5aae56fedd0655cb05bf3ef

This is a bit more careful about calling dask.delayed everywhere. As a result we avoid accidentally putting concrete numpy arrays into dask objects (which would cause a lot of client-to-cluster communication) and avoid calling compute unnecessarily (which would cause a lot of cluster-to-client communication)

Things run decently well when using processes=False though poorly with multiple processs. This uncovered an issue with Dask/Numba integration here that I've raised here: https://github.com/numba/numba/issues/3026

mrocklin commented 6 years ago

Some tasks for future work:

  1. Dive into dask/numba issues (this is probably either on me or on numba devs)
  2. Do this again with scipy sparse matrices (I believe that this was @fabianp 's original implementation)
  3. Implement stopping criteria?
  4. Integrate this into the infrastructure expected by the rest of dask-ml. @TomAugspurger can you suggest what the right approach here is?
TomAugspurger commented 6 years ago

Integrate this into the infrastructure expected by the rest of dask-ml

We have a couple options. IMO, since this is a solver, it could go in dask-glm. We could maybe integrate with some of the Family stuff already there. There will be some infrastructure things like adding numba as a dependency that I'm happy to take care of. And then the estimators in dask_ml.linear_model (e.g. LogisticRegression) could have solver='saga'.

mrocklin commented 6 years ago

I wonder if @cicdw has any interest in looking at SAGA here and implementing it within dask-glm. I suspect that he's busy with other endeavors, but it never hurts to try a gentle nerd-snipe :)

cicdw commented 6 years ago

@mrocklin my interest is definitely piqued; let me take some time to review the prior art in this thread and see what I can put together.