dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
893 stars 255 forks source link

TruncatedSVD svd_flip creates single-chunked array #401

Open mrocklin opened 5 years ago

mrocklin commented 5 years ago

Currently we use this function on the outputs of svd

def svd_flip(u, v):
    u2, v2 = delayed(skm.svd_flip, nout=2)(u, v)
    u = da.from_delayed(u2, shape=u.shape, dtype=u.dtype)
    v = da.from_delayed(v2, shape=v.shape, dtype=v.dtype)
    return u, v

This forces potentially large (I think?) multi-chunked arrays into a single chunk

I wonder if the actual sklearn function might work on its own. It doesn't appear to use any functionality outside of dask.array (assuming that functions like np.sign and np.argmax function as ufuncs)

def svd_flip(u, v, u_based_decision=True):
    """Sign correction to ensure deterministic output from SVD.

    Adjusts the columns of u and the rows of v such that the loadings in the
    columns in u that are largest in absolute value are always positive.

    Parameters
    ----------
    u : ndarray
        u and v are the output of `linalg.svd` or
        `sklearn.utils.extmath.randomized_svd`, with matching inner dimensions
        so one can compute `np.dot(u * s, v)`.

    v : ndarray
        u and v are the output of `linalg.svd` or
        `sklearn.utils.extmath.randomized_svd`, with matching inner dimensions
        so one can compute `np.dot(u * s, v)`.

    u_based_decision : boolean, (default=True)
        If True, use the columns of u as the basis for sign flipping.
        Otherwise, use the rows of v. The choice of which variable to base the
        decision on is generally algorithm dependent.

    Returns
    -------
    u_adjusted, v_adjusted : arrays with the same dimensions as the input.

    """
    if u_based_decision:
        # columns of u, rows of v
        max_abs_cols = np.argmax(np.abs(u), axis=0)
        signs = np.sign(u[max_abs_cols, xrange(u.shape[1])])
        u *= signs
        v *= signs[:, np.newaxis]
    else:
        # rows of v, columns of u
        max_abs_rows = np.argmax(np.abs(v), axis=1)
        signs = np.sign(v[xrange(v.shape[0]), max_abs_rows])
        u *= signs
        v *= signs[:, np.newaxis]
    return u, v

Or maybe this is unnecessary?

TomAugspurger commented 5 years ago

Copying the scikit-learn implementation seems fine. I was probably just trying to reduce maintenance, but not realizing the the delayed(svd_flip) would create a 1-block dask array.

mrocklin commented 5 years ago

Calling a delayed function on any dask collection will eventually force that collection into a concrete version of itself (a numpy array or pandas dataframe) so that it can call the function on it.

On Thu, Oct 11, 2018 at 4:20 PM Tom Augspurger notifications@github.com wrote:

Copying the scikit-learn implementation seems fine. I was probably just trying to reduce maintenance, but not realizing the the delayed(svd_flip) would create a 1-block dask array.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/401#issuecomment-429104072, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszNsPCEY8gVizHZ-S8kEKHPlJMOU7ks5uj6fxgaJpZM4XX_Wh .

TomAugspurger commented 5 years ago

Tried this briefly.

signs = da.sign(u[max_abs_cols, list(range(u.shape[1]))])

because

   1241         if any(isinstance(i, Array) and i.dtype.kind in 'iu' for i in index2):
-> 1242             self, index2 = slice_with_int_dask_array(self, index2)
   1243         if any(isinstance(i, Array) and i.dtype == bool for i in index2):
   1244             self, index2 = slice_with_bool_dask_array(self, index2)

~/sandbox/dask/dask/array/slicing.py in slice_with_int_dask_array(x, index)
    890     ]
    891     if sum(fancy_indexes) > 1:
--> 892         raise NotImplementedError("Don't yet support nd fancy indexing)")
    893
    894     out_index = []

NotImplementedError: Don't yet support nd fancy indexing)

Haven't tried looking for alternative ways of doing this.

mrocklin commented 5 years ago

You can probably replace this with

signs = da.sign(u[max_abs_cols, :u.shape[1]])
TomAugspurger commented 5 years ago

Is this called "pointwise indexing?"

ipdb> x[max_abs_cols, [0, 1]]
array([ 0.11476366, -0.09881881])
ipdb> x[max_abs_cols, :2]
array([[ 0.11476366, -0.00279077],
       [ 0.03486335, -0.09881881]])

we want the first kind. Which should be equivalent to slicing and then doing diag, I think

ipdb> np.diag(x[max_abs_cols])
array([ 0.11476366, -0.09881881])
mrocklin commented 5 years ago

You might want x.vindex[...]