API discussion - Githubissues

cicdw commented 7 years ago

One of the biggest hanging issues is to organize / discuss what the API will look like for dask_glm and what sorts of things we expose. This decision impacts algorithm structure going forward (as we incorporate other GLM families and regularizers) so I will need some help focusing the following discussion on specific action items. My initial thoughts were to:

Aim for generality so users can create their own GLM families / regularizers: My initial thought along these lines were to have an Optimizer(**initialization_args, **numerical_algo_args, **regularization_args) class that can be inherited by any class that implements func(), grad(), hessian() methods. This would allow us to focus on algorithm development separately from model specification and add additional model types (Normal, Poisson, etc.) easily. The user then calls Logistic(X,y, **kwargs) to initialize a model, and then .fit(**algo_args) calls on the methods from Optimizer. This can get hairy, and could possibly prevent further optimizations that exploit specific structure inherent in the models.
Hard code the models users can choose from: Per my conversation with @jcrist, we might not want to aim for extreme flexibility in the model API, as most people probably don't need / want to define their own GLM families or their own regularizers. Related, we might want to implement the algorithms specifically for each family / regularizer.
Regularized vs. unregularized models: How do we manage the 'switch' from regularized to unregularized models? Should it be something as simple as model = LogisticModel(X, y, reg='l2') and the user chooses from a pre-defined list of regularizers, or should we allow for model = LogisticModel(X, y, reg=reg_func) where reg_func() is any regularization function on the parameters, allowing for customizations such as grouped lasso, non-negativity, etc? Again I was hoping to keep things as general as possible so that the user can mix and match, but on the algorithm side this can get difficult especially if we want speed and accuracy.

-Inherit from scikit-learn: Per my conversation with @hussainsultan, we might want to try and stick as closely as possible to scikit-learn's API, as this is common enough to be comfortable for most people. However, we don't want to adhere to this too strongly (especially in output), because we started with the goal of intelligently toggling between inferential and predictive applications (thus requiring things like inferential stats that scikit-learn currently doesn't include).

Ultimately, I think my thought process boils down to: should we expose an API to access the underlying algorithms or should we bake them into the various specific GLM families we want to consider?

cc: @mrocklin @mcg1969

mrocklin commented 7 years ago

From the perspective of someone who is not intimately familiar with this field it would be useful to see an example algorithm for a particular case (perhaps un-regularized logistic regression using gradient descent) and then highlight which elements of the code we would change if we were to add L1 or L2 regularization, switch out logistic for some other GLM family, or replace gradient descent with something else.

If it is possible to create an example where those options are swapable then that exercise may help to inform a general solution. I'm talking well outside of my expertise though.

Regarding scikit-learn is it possible to modestly extend the fit/predict API with some other verb to provide inferential information?

should we expose an API to access the underlying algorithms or should we bake them into the various specific GLM families we want to consider?

I generally try to build concrete things first and then generalize them. However it requires some finesse to build the concrete thing quickly enough to get valuable user feedback while also not painting yourself into a corner to the point where you can't refactor later. Sorry for the platitude; I have nothing concrete to provide here.

hussainsultan commented 7 years ago

@moody-marlin Thanks for raising this.

However, we don't want to adhere to this too strongly (especially in output), because we started with the goal of intelligently toggling between inferential and predictive applications.

What type of outputs do you expect that would be different than scikit-learn?

cicdw commented 7 years ago

@mrocklin Thanks for the feedback, that's a good exercise that I can work on writing up.

@hussainsultan I would expect output more similar to statsmodels, so something like a model object that contains p-values, AIC, BIC, Somer's D, adjusted R^2, outlier / residual tests, etc. Some of these values are still meaningful for regularized problems (e.g., R^2) while others are not (e.g., p-values).

mcg1969 commented 7 years ago

I'd have to cast my vote to something that hews closer to a statsmodels approach than scikit-learn, both in terms of its API and the types of outputs it provides.

cicdw commented 7 years ago

In terms of what is exposed and where, here is where my thoughts are: https://github.com/moody-marlin/dask-glm/blob/api_build/dask_glm/logistic.py

cc: @mrocklin @jcrist @hussainsultan

mrocklin commented 7 years ago

Throwing Chris' example into a comment here for discussion

class LogisticModel(object):
    def hessian(self):
    def gradient(self):
    def loglike(self):
    def fit(self, tol=1e-8, warm_start=None, random_state=None, max_iter=250):
    def _initialize(self):
    def _check_inputs(self):
    def _set_intercept(self):
    def __init__(self, X, y, reg=None, lamduh=0, fit_intercept=True)

I'm not qualified to judge on most of this but I'm curious what some of these methods do. Do all models implement a hessian? Or is this just used during solving if its implemented?

Is there a way to satisfy both the sklearn and statsmodels APIs? Roughly what is the statsmodels API?

Are the solvers built here tailored to Logistic regression or are they more general?

cicdw commented 7 years ago

All good questions; the current solvers which are immediately generalizable to other models are: gradient_descent, admm, and newton by replacing function / gradient / hessian evaluations with calls to the above methods, and by replacing the call to shrinkage in admm with a call to proximal_operator which can be used for any regularizer.

In my mind, there are two big differences between sklearn and statsmodels:

Upon initialization, sklearn takes in algorithm parameters such as initialization, convergence tolerance, etc. and then data is fed into the .fit() call. Statsmodels, on the other hand, is initialized with data and then the call to .fit() accepts algorithm parameters.
The output of sklearn is a model object with a few new attributes coefs, and a few things about convergence. Statsmodels outputs a model object with many new attributes, including various inferential statistics and statistical tests, along with a .summary() method to see all this.

In my mind, the organization of arguments in statsmodels makes more sense, but it could be that sklearn's interface is more amenable to pipelines and deployment?

Re: the methods, the loglike / gradient / hessian methods will be used for both fitting the model parameters and for producing output statistics. _initialize() will be used for creating initializations for the algorithms (there are some smarter ways than just all zeros that could lead to quicker convergence). _check_inputs will check the data is chunked appropriately, and will handle differences between dask dataframes vs. dask arrays. _set_intercept will simply add a column of all 1's to the input data.

hussainsultan commented 7 years ago

Ok... these are my thoughts on how to make this work with sklearn interface:


from sklearn.base import BaseEstimator, ClassifierMixin

def grad(w, X, y, lambduh):
    ...
    return out, grad

class Logistic(BaseEstimator, ClassifierMixin):

    def __init__(self, tol=1e-8, lamduh=0.1):
        pass

    def fit(self, X, y):
        n_samples, n_features = X.shape

        self.beta = self._solver(X,y)

        return self

    def predict(self, X):
        return np.sign(safe_sparse_dot(X, self.beta))

    def  _solver(self,X,y):
         """ this can be configurable based on methods in logistic.py, similarly we could add other methods like `def _hessian`  """
         return dask_glm.proximal_gradient(X,y)

    def summary(self):
         pass

the reason i like this is that it will work with sklearn as long as all of our methods work with numpy as well as dask.array.

also, if it were a dask.array we could possibly use dask-learn pipelines.

@jcrist thoughts on above?

cicdw commented 7 years ago

Could you provide an example of a feature we gain by inheriting directly from sklearn ? I don't clearly see what these sklearn classes provide us.

jcrist commented 7 years ago

Scikit-learn doesn't enforce inheriting from a base class, rather it requires a duck-typed interface. However, all of scikit-learn is written expecting immediate array-likes, so I'm not sure if dask-glm classes can really be mixed with scikit-learn stuff directly (at least when passing in dask-arrays). That said, I think matching the interface is a good idea, but not necessarily subclassing (could go either way here).

The required interface is described in detail here: http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator.

also, if it were a dask.array we could possibly use dask-learn pipelines.

Those have been removed, as they were difficult to maintain, and not that useful in practice. That package now only implements model-parallelism, accelerating grid-search/randomized-search across small data. dask-glm wouldn't really fit with the old pipelines anyway, as it evaluates immediately, while those were lazy.

hussainsultan commented 7 years ago

thanks @jcrist I wasnt aware that duck typing was sufficient for sklearn interface. This is super helpful!

@moody-marlin i am happy with what you proposed with slight modification that we try to adhere to sklearn API. I think this is general enough that it will allow for the thing that you wanted to do. One example that i could think of was using sklearn pipelines that will work for numpy inputs. http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html What do you think?

mcg1969 commented 7 years ago

class LogisticModel(object):
    def hessian(self):
    def gradient(self):
    def loglike(self):
    def fit(self, tol=1e-8, warm_start=None, random_state=None, max_iter=250):
    def _initialize(self):
    def _check_inputs(self):
    def _set_intercept(self):
    def __init__(self, X, y, reg=None, lamduh=0, fit_intercept=True)

This strikes me as an inappropriate mixture of categories. Methods like gradient and Hessian refer, I assume, to the ability to compute individual values and derivatives of a particular function. And of course, not every model we'd like to build fitting algorithms for are differentiable (e.g., regularized models). I know I'm late to the party, but is there a technical reason we need to conflate functions and models?

TomAugspurger commented 7 years ago

I did a bit of rough tinkering to hookup statsmodels and dask / dask-glm here. The two biggest tasks were getting statsmodels Model class to accept dask arrays (that version is hacky, no error checks, no patsy), and using dask-glm in the .fit step.

The most awkward part was statsmodels' .fit methods (at least the likelihood-based ones) using something like scipy.minimize which takes a function, starting parameters, and maybe a gradient or hessian. Contrast that with gask-glm which takes an X, y, and family.

I haven't really thought through whether dask-glm splitting the optimization part off from the data part is even desirable from dask-glm's point of view. Just wanted to share some feedback.

mrocklin commented 7 years ago

This notebook also ran fine on a distributed system (I think) at least locally. You might want to add a .persist() at the end of the from_array calls.

In general I like that this seems to be on a path towards figuring out a good hand-off point between the projects.

        # This section is a bit awkward. From statsmodels POV, dask-glm would
        # provide an optimizer f: llf, score -> opt
        optimizer = self._get_optimizer()()

cc @moody-marlin thoughts on this? @TomAugspurger is this the full expected interface below?

class DaskOptimizer(Optimizer):

    def _fit(f, score, start_params, fargs, kwargs, hessian,
             method, disp, maxiter, callback, retall, full_output):
        from dask_glm.algorithms import admm
        return admm(self.exog, self.endog)

Here you hard code Logistic. Is there a general way to do this within statsmodels? Or is this patsy?

class DaskGLM(GLM):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        from dask_glm.families import Logistic
        self.family = Logistic

TomAugspurger commented 7 years ago

is this the full expected interface below?

IIRC, this isn't perfectly unified across all of statsmodels. Certain fit methods / optimizers have their own arguments. I was meaning to look more closely but ran out of time.

Here you hard code Logistic. Is there a general way to do this within statsmodels? Or is this patsy?

You provide a family argument when you instantiate the model:

sm.GLM(data.endog, data.exog, family=sm.families.Gamma()

I haven't looked but I assume that statsmodels' version has all the info we need to lookup the corresponding dask-glm version.

mrocklin commented 7 years ago

@TomAugspurger what do you think is the right way forward here? Is this worth pursuing? What are things that should be worked on on the dask-glm side? Are there open questions for the statsmodels community?

TomAugspurger commented 7 years ago

I'll do some more experimentation, but it'll be a bit.

Are there open questions for the statsmodels community?

I'd like to get a more cleaned up version of dask + statsmodels working, to essentially replace https://github.com/statsmodels/statsmodels/blob/master/examples/notebooks/distributed_estimation.ipynb I'd prefer statsmodels users use dask for larger datasets, rather than writing their own larger-than-memory handling.

What are things that should be worked on on the dask-glm side?

Not really sure at this point. I still don't have a good feeling for how much additional work separating the optimizer would be. Should have some time in a few weeks.

On Mon, Mar 13, 2017 at 9:28 AM, Matthew Rocklin notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger what do you think is the right way forward here? Is this worth pursuing? What are things that should be worked on on the dask-glm side? Are there open questions for the statsmodels community?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-glm/issues/11#issuecomment-286123447, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIj-fa-dZEIjdjyjF_9xGjGZ6hs6Nks5rlVKggaJpZM4LvG12 .

TomAugspurger commented 7 years ago

If you're interested, I've taken a stab at this in https://github.com/dask/dask-glm/pull/40

dask / dask-glm

API discussion #11