ADMM is susceptible to overflow

TomAugspurger commented 7 years ago

(edited to not require the new API)

Probably other algorithms too. I see that proximal_grad does handle it correctly. This little script compares the fit for scikit-learn LogisticRegression and our admm against a baseline of no-overflow, to a copy of the dataset with n values in a range that will overflow when passed through the sigmoid function.

import warnings

import dask.array as da
import numpy as np
from scipy.optimize import fmin_l_bfgs_b
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from dask_glm.algorithms import admm

warnings.simplefilter("ignore", RuntimeWarning)

C = 10
lambda_ = 1 / C
np.random.seed(2)
X, y = datasets.make_classification(n_classes=2, n_samples=1000)

Xda = da.from_array(X, (10, X.shape[1]))
yda = da.from_array(y, (10,))

lrbase = LogisticRegression(fit_intercept=False, C=C, max_iter=10).fit(X, y)
admm_base = admm(Xda, yda, lamduh=lambda_, max_iter=10)

def fit_and_compare(X, y, n):
    X = X.copy()
    np.random.seed(2)
    idx = np.random.choice(np.arange(len(X)), size=(n,))
    X[idx, 1] = np.random.randint(750, 1000, size=(n,))

    lr = LogisticRegression(fit_intercept=False, C=C, max_iter=10)
    lr.fit(X, y)

    Xda = da.from_array(X, (10, X.shape[1]))
    yda = da.from_array(y, (10,))
    result = admm(Xda, yda, lamduh=lambda_, max_iter=10)

    print(f"Overflow={n}")
    print("Sklearn:", np.mean(np.abs(lrbase.coef_ - lr.coef_)))
    print("admm   :", np.mean(np.abs(admm_base - result)))
    return lr.coef_, result

fit_and_compare(X, y, 0)
fit_and_compare(X, y, 5)
fit_and_compare(X, y, 10)
fit_and_compare(X, y, 50)
fit_and_compare(X, y, 100)

Outputs

Overflow=0
Sklearn: 6.21681525703e-17
admm   : 1.04008099876e-13
Overflow=5
Sklearn: 0.00377521192273
admm   : 0.0975875697572
Overflow=10
Sklearn: 0.00311021598129
admm   : 0.118998783966
Overflow=50
Sklearn: 0.00055956264663
admm   : 0.178748360839
Overflow=100
Sklearn: 0.00312859463805
admm   : 0.196356993923

cicdw commented 7 years ago

This is a good find; philosophical question: is this an issue for ADMM or for the Logistic family? I imagine we could control the overflow from within the gradient / function calls there and leave ADMM untouched.

TomAugspurger commented 7 years ago

This is a good find; philosophical question: is this an issue for ADMM or for the Logistic family?

Oh good call. The overflow is happening in the exponential of the loglike step.

cicdw commented 7 years ago

Tried fixing this over here; the change prevents the overflow.

The default penalty for scikit-learn's LogisticRegression is actually l2, so in your experiment above the two algorithms are solving different problems (our default penalty for admm is l1). (I imagine that happened because you switched out of our API which has the same defaults). However, after accounting for this the two algorithms are still giving different results.

I believe this is because somewhere under the hood, scikit-learn's fit normalizes the columns of X, and then un-normalizes the coefficient estimates at the end. This is probably a good idea for us, too; I might add that functionality to my branch before I PR -- I'm sure there are some performance implications of doing that that I should consider first.

dask / dask-glm

ADMM is susceptible to overflow #41