How to handle intercept term?

dask / dask-glm

BSD 3-Clause "New" or "Revised" License

75 stars 46 forks source link

How to handle intercept term? #13

Open cicdw opened 7 years ago

cicdw commented 7 years ago

What is the best way to handle intercepts?

Right now, the algorithms assume the user creates a column of 1s in their dask array, à la statsmodels. However, sometimes it's convenient to have a fit_intercept option similar to scikit-learn. Having this option set to True will require a step which appends a column of 1's to the user-supplied dask array, but it won't be as simple as the corresponding numpy case.

@mrocklin

mrocklin commented 7 years ago

If the question is, how do I concatenate a column of ones, then the answer is to use the da.concatenate function.

In [2]: import dask.array as da

In [3]: x = da.random.random((5, 2), chunks=(2, 2))

In [4]: o = da.ones((x.shape[0], 1), chunks=(x.chunks[0], (1,)))

In [5]: z = da.concatenate([x, o], axis=1)

In [6]: z.compute()
Out[6]: 
array([[ 0.16174789,  0.06872224,  1.        ],
       [ 0.01018076,  0.68570003,  1.        ],
       [ 0.31238221,  0.91503403,  1.        ],
       [ 0.90225416,  0.04750495,  1.        ],
       [ 0.98440154,  0.22888387,  1.        ]])

cicdw commented 7 years ago

Naive attempt at using this to add an intercept makes admm choke:

X = da.random.random((100, 2), chunks=(50,2))
y = make_y(X, beta=np.array([-1.0, 2]), chunks=(50,))
o = da.ones((X.shape[0], 1), chunks=(X.chunks[0], (1,)))
z = da.concatenate([X, o], axis=1)
admm(z, y)
...
ValueError: shapes (50,1) and (3,) not aligned: 1 (dim 1) != 3 (dim 0)

Traceback

  File "algorithms.py", line 199, in wrapped
    return func(beta, X, y) + (rho / 2) * np.dot(beta - z + u,
  File "families.py", line 17, in pointwise_loss
    Xbeta = X.dot(beta)

This could be an issue with how admm assumes the data is chunked, or there might be another way we should handle intercepts?

mrocklin commented 7 years ago

These lines are problematic

XD = X.to_delayed().flatten().tolist()
yD = y.to_delayed().flatten().tolist()

I recommend first rechunking these arrays to have only a single chunk along columns

In [8]: z
Out[8]: dask.array<concate..., shape=(100, 3), dtype=float64, chunksize=(50, 2)>

In [9]: z.rechunk((None, z.shape[1]))
Out[9]: dask.array<rechunk..., shape=(100, 3), dtype=float64, chunksize=(50, 3)>