Open mrocklin opened 6 years ago
@fabianp could you post that bit of code? I'd like to see.
I uploaded it here: https://gist.github.com/fabianp/6c946859cf2ab6042ca036663aa17f25
Thanks @fabianp!
I cleaned this up a bit and re-uploaded it here:
https://gist.github.com/3f64e82dd5aae56fedd0655cb05bf3ef
This is a bit more careful about calling dask.delayed everywhere. As a result we avoid accidentally putting concrete numpy arrays into dask objects (which would cause a lot of client-to-cluster communication) and avoid calling compute unnecessarily (which would cause a lot of cluster-to-client communication)
Things run decently well when using processes=False
though poorly with multiple processs. This uncovered an issue with Dask/Numba integration here that I've raised here: https://github.com/numba/numba/issues/3026
Some tasks for future work:
Integrate this into the infrastructure expected by the rest of dask-ml
We have a couple options. IMO, since this is a solver, it could go in dask-glm. We could maybe integrate with some of the Family stuff already there. There will be some infrastructure things like adding numba as a dependency that I'm happy to take care of. And then the estimators in dask_ml.linear_model
(e.g. LogisticRegression
) could have solver='saga'
.
I wonder if @cicdw has any interest in looking at SAGA here and implementing it within dask-glm. I suspect that he's busy with other endeavors, but it never hurts to try a gentle nerd-snipe :)
@mrocklin my interest is definitely piqued; let me take some time to review the prior art in this thread and see what I can put together.
I believe that @fabianp has a proof of concept for an incremental SAGA implementation. @fabianp, can I ask you to push a bit of code, perhaps as a gist, and then link from issue? I would like to make sure that we don't lose this.