dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
892 stars 255 forks source link

Fitting dask_ml.decomposition.PCA returns TypeError: sum() got an unexpected keyword argument 'keepdims' #601

Open DoDzilla-ai opened 4 years ago

DoDzilla-ai commented 4 years ago

My data is actually a scipy.sparse.csr_matrix. In order to convert this to a dask.array, I am sending the data with client.scatter and then I am using dask.array.from_delayed. Lastly, I am using fit, which is returning this error: TypeError: sum() got an unexpected keyword argument 'keepdims'. Below you can find the information about the variables used in the code, the code itself and the full-traceback. I will try to add a minimal-working example.

Variable Information:

X before conversion to dask array:

>>>X.type
<class 'scipy.sparse.csr.csr_matrix'>
>>>X.shape
(10000, 1000)

X_distributed:

>>>X_distributed
<Future: finished, type: scipy.csr_matrix, key: csr_matrix-a1d662ba4773c868422a2a23905fe4f3>

X after conversion to dask array:

>>>X
dask.array<from-value, shape=(10000, 1000), dtype=float64, chunksize=(10000, 1000), chunktype=numpy.ndarray>
>>>X.type
<class 'dask.array.core.Array'>
>>>X.shape
(10000, 1000)
>>>type(X.compute())
<class 'scipy.sparse.csr.csr_matrix'>

Code

>>>from dask_ml.decomposition import PCA as DaskPca
# Do some stuff here. 
#
#
>>>dask_pca = DaskPca(n_components=None)
>>>X_distributed = self.client.scatter(X)
>>>X = da.from_delayed(X_distributed, shape=X.shape, dtype=float)
>>>dask_pca.fit(X=X)

Full-traceback

File "/home/dodzilla/my_project/components_with_adapter/dimension_reduction/reduction_adapter.py", line 101, in fit_transform
    dask_pca.fit(X=X)
  File "/home/dodzilla/.local/lib/python3.6/site-packages/dask_ml/decomposition/pca.py", line 190, in fit
    self._fit(X)
  File "/home/dodzilla/.local/lib/python3.6/site-packages/dask_ml/decomposition/pca.py", line 325, in _fit
    singular_values,
  File "/home/dodzilla/.local/lib/python3.6/site-packages/dask/base.py", line 436, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/dodzilla/.local/lib/python3.6/site-packages/distributed/client.py", line 2573, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/home/dodzilla/.local/lib/python3.6/site-packages/distributed/client.py", line 1873, in gather
    asynchronous=asynchronous,
  File "/home/dodzilla/.local/lib/python3.6/site-packages/distributed/client.py", line 768, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/home/dodzilla/.local/lib/python3.6/site-packages/distributed/utils.py", line 334, in sync
    raise exc.with_traceback(tb)
  File "/home/dodzilla/.local/lib/python3.6/site-packages/distributed/utils.py", line 318, in f
    result[0] = yield future
  File "/home/dodzilla/.local/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/dodzilla/.local/lib/python3.6/site-packages/distributed/client.py", line 1729, in _gather
    raise exception.with_traceback(traceback)
  File "/home/dodzilla/.local/lib/python3.6/site-packages/dask/optimization.py", line 982, in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/home/dodzilla/.local/lib/python3.6/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/home/dodzilla/.local/lib/python3.6/site-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/home/dodzilla/.local/lib/python3.6/site-packages/dask/utils.py", line 29, in apply
    return func(*args, **kwargs)
  File "/home/dodzilla/.local/lib/python3.6/site-packages/dask/array/reductions.py", line 539, in mean_chunk
    total = sum(x, dtype=dtype, **kwargs)
  File "<__array_function__ internals>", line 6, in sum
  File "/home/dodzilla/.local/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2182, in sum
    initial=initial, where=where)
  File "/home/dodzilla/.local/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
    return reduction(axis=axis, dtype=dtype, out=out, **passkwargs)
TypeError: sum() got an unexpected keyword argument 'keepdims'
TomAugspurger commented 4 years ago

scipy.sparse matrices don't support the ndarray interface, so many dask.array methods don't work with them . A simpler example

In [10]: X = scipy.sparse.eye(10, format='csr')

In [11]: dX = da.from_array(X)

In [12]: dX.sum()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-fa1407bd43c5> in <module>
----> 1 dX.sum()

~/sandbox/dask/dask/array/core.py in sum(self, axis, dtype, keepdims, split_every, out)
   1960             keepdims=keepdims,
   1961             split_every=split_every,
-> 1962             out=out,
   1963         )
   1964

~/sandbox/dask/dask/array/reductions.py in sum(a, axis, dtype, keepdims, split_every, out)
    338         dtype=dtype,
    339         split_every=split_every,
--> 340         out=out,
    341     )
    342     return result

~/sandbox/dask/dask/array/reductions.py in reduction(x, chunk, aggregate, axis, keepdims, dtype, split_every, combine, name, out, concatenate, output_size, meta)
    155     # The dtype of `tmp` doesn't actually matter, and may be incorrect.
    156     tmp = blockwise(
--> 157         chunk, inds, x, inds, axis=axis, keepdims=True, dtype=dtype or float
    158     )
    159     tmp._chunks = tuple(

~/sandbox/dask/dask/array/blockwise.py in blockwise(func, out_ind, name, token, dtype, adjust_chunks, new_axes, align_arrays, concatenate, meta, *args, **kwargs)
    231         from .utils import compute_meta
    232
--> 233         meta = compute_meta(func, dtype, *args[::2], **kwargs)
    234     if meta is not None:
    235         return Array(graph, out, chunks, meta=meta)

~/sandbox/dask/dask/array/utils.py in compute_meta(func, _dtype, *args, **kwargs)
    125                 if has_keyword(func, "computing_meta"):
    126                     kwargs_meta["computing_meta"] = True
--> 127                 meta = func(*args_meta, **kwargs_meta)
    128             except TypeError as e:
    129                 if (

<__array_function__ internals> in sum(*args, **kwargs)

~/Envs/dask-dev/lib/python3.7/site-packages/numpy/core/fromnumeric.py in sum(a, axis, dtype, out, keepdims, initial, where)
   2227
   2228     return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
-> 2229                           initial=initial, where=where)
   2230
   2231

~/Envs/dask-dev/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
     84             # support a dtype.
     85             if dtype is not None:
---> 86                 return reduction(axis=axis, dtype=dtype, out=out, **passkwargs)
     87             else:
     88                 return reduction(axis=axis, out=out, **passkwargs)

TypeError: sum() got an unexpected keyword argument 'keepdims'

You might be able to use pydata/sparse library. I don't think there's anything for dask-ml to do here.

DoDzilla-ai commented 4 years ago

Well, at least the error message can be a little bit informative. There is this error already implemented if the input data is scipy.sparse.csr_matrix: TypeError: Cannot fit PCA on sparse 'X'. I've spent 1-2 hours trying to figure out the TypeError: sum() got an unexpected keyword argument 'keepdims' PS: I am a newbie. Sorry...

TomAugspurger commented 4 years ago

Well, at least the error message can be a little bit informative

Perhaps, though I don't recall if we can always distinguish between a Dask Array backed by scipy.sparse matricies and a Dask Array backed by a sparse ndarray. Is this something you're interested in investigating further? The Array._meta attribute may have the information we need.