Open cicdw opened 7 years ago
If the question is, how do I concatenate a column of ones, then the answer is to use the da.concatenate
function.
In [2]: import dask.array as da
In [3]: x = da.random.random((5, 2), chunks=(2, 2))
In [4]: o = da.ones((x.shape[0], 1), chunks=(x.chunks[0], (1,)))
In [5]: z = da.concatenate([x, o], axis=1)
In [6]: z.compute()
Out[6]:
array([[ 0.16174789, 0.06872224, 1. ],
[ 0.01018076, 0.68570003, 1. ],
[ 0.31238221, 0.91503403, 1. ],
[ 0.90225416, 0.04750495, 1. ],
[ 0.98440154, 0.22888387, 1. ]])
Naive attempt at using this to add an intercept makes admm
choke:
X = da.random.random((100, 2), chunks=(50,2))
y = make_y(X, beta=np.array([-1.0, 2]), chunks=(50,))
o = da.ones((X.shape[0], 1), chunks=(X.chunks[0], (1,)))
z = da.concatenate([X, o], axis=1)
admm(z, y)
...
ValueError: shapes (50,1) and (3,) not aligned: 1 (dim 1) != 3 (dim 0)
Traceback
File "algorithms.py", line 199, in wrapped
return func(beta, X, y) + (rho / 2) * np.dot(beta - z + u,
File "families.py", line 17, in pointwise_loss
Xbeta = X.dot(beta)
This could be an issue with how admm
assumes the data is chunked, or there might be another way we should handle intercepts?
These lines are problematic
XD = X.to_delayed().flatten().tolist()
yD = y.to_delayed().flatten().tolist()
I recommend first rechunking these arrays to have only a single chunk along columns
In [8]: z
Out[8]: dask.array<concate..., shape=(100, 3), dtype=float64, chunksize=(50, 2)>
In [9]: z.rechunk((None, z.shape[1]))
Out[9]: dask.array<rechunk..., shape=(100, 3), dtype=float64, chunksize=(50, 3)>
What is the best way to handle intercepts?
Right now, the algorithms assume the user creates a column of
1
s in their dask array, à lastatsmodels
. However, sometimes it's convenient to have afit_intercept
option similar toscikit-learn
. Having this option set toTrue
will require a step which appends a column of 1's to the user-supplied dask array, but it won't be as simple as the correspondingnumpy
case.@mrocklin