dask / dask-glm

BSD 3-Clause "New" or "Revised" License
75 stars 46 forks source link

Asynchronous solvers #5

Open mrocklin opened 7 years ago

mrocklin commented 7 years ago

So far all of our solvers are synchronous. They compute full results in lock-step, for example switching between performing a parallel mat-vec, then doing a line search, and then doing a mat-vec again. These algorithms are common on single-machine hardware but may not be ideal for larger clusters.

The distributed scheduler provides some decent capabilities for full asynchronous computing, which may open us up to new algorithms. Are there asynchronous variants to some of these algorithms that may interest us?

Quick example of asynchronous code:

data_futures = client.map(load_chunk, chunks)
params = {...}

futures = client.map(compute_update, random.sample(100, data_futures), **params)

ac = as_completed(futures)  # collection of running futures that yield in order of completion
for future in ac:
    update_info, score = ac.result()
    if is_good(score):
        break
    update_params(params, update_info)
    new_future = client.submit(compute_update, random.choice(data_futures), **params)
    ac.add(new_future)

@hussainsultan @mcg1969 @jcrist @moody-marlin

See Also

mrocklin commented 7 years ago

In particular I was reading this paper on asynchronous sgd: https://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf

It's obviously focused on deep learning applications, but thought that some of the ideas may carry over here as well. I'm unsure.

mrocklin commented 7 years ago

To motivate this thinking, I'm trying to eliminate the white space in profiles like the following:

hussainsultan commented 7 years ago

this is asynchronous variant of ADMM: http://jmlr.org/proceedings/papers/v32/zhange14.pdf However, i am not sure if we need it since the global updates are the fastest (unlike line search step above).

mrocklin commented 7 years ago

We may still run into straggler issues where we have to wait for a few workers to finish the last pieces of data before we can synchronize updates and broadcast them out to start work everywhere again.

However, it's not clear that this will be a performance issue in practice.

mcg1969 commented 7 years ago

I would like to push back against this as a significant distraction from the core focus of this effort. We should not be moving towards a tradeoff between accuracy and speed at this point, which is exactly what a move to asynchronous approaches represent.

Besides: I'm not convinced yet that the white space above is due to fundamental limits in the algorithms. I'm waiting on some example notebooks from Chris to do some testing, but I have a suspicion there is an error in our existing design.

mrocklin commented 7 years ago

I agree that the whitespace above can probably be resolved through other methods. My main goal in this issue is to establish that this is a tool in our toolbox, should it become valuable. I'm fine waiting to think about this idea until synchronization costs become a performance issue.

mrocklin commented 7 years ago

I'm waiting on some example notebooks from Chris to do some testing

@mcg1969 if this is useful I do the following (also from @moody-marlin):

import dask.array as da
import numpy as np
from dask_glm.logistic import *
from dask_glm.utils import *

from distributed import Client
c = Client()

## size of problem (no. observations)
N = 1e8
chunks = 1e6
seed = 20009

X = da.random.random((N,2), chunks=chunks)
y = make_y(X, beta=np.array([-1.5, 3]), chunks=chunks)

X, y = persist(X, y)

# newton(X,y)

bfgs(X,y,tol=1e-8)

Requires master versions of dask and dask-glm

mrocklin commented 7 years ago

this is asynchronous variant of ADMM: http://jmlr.org/proceedings/papers/v32/zhange14.pdf However, i am not sure if we need it since the global updates are the fastest (unlike line search step above).

That paper shows faster convergence than what you would expect just by filling in whitespace. I suspect that we accelerate a bit just because there is redundancy in the data itself, and so many-small-fast updates on partial datasets converge more quickly.

mrocklin commented 7 years ago

Anyway, writing the communication side of that algorithm looks more-or-less trivial from a dask.distributed perspective. I think it would be a fun demonstration. I'm not sure what the update functions should be though.

mrocklin commented 7 years ago

@moody-marlin any thoughts on solidifying the async admm solver in algorithms.py? Presumably this would need some hardening of the logic to support convergence checks and whatnot.

cicdw commented 7 years ago

Yea that sounds like a good plan; I can look into this more on Thursday / Friday.