Open cicdw opened 7 years ago
From the perspective of someone who is not intimately familiar with this field it would be useful to see an example algorithm for a particular case (perhaps un-regularized logistic regression using gradient descent) and then highlight which elements of the code we would change if we were to add L1 or L2 regularization, switch out logistic for some other GLM family, or replace gradient descent with something else.
If it is possible to create an example where those options are swapable then that exercise may help to inform a general solution. I'm talking well outside of my expertise though.
Regarding scikit-learn is it possible to modestly extend the fit/predict API with some other verb to provide inferential information?
should we expose an API to access the underlying algorithms or should we bake them into the various specific GLM families we want to consider?
I generally try to build concrete things first and then generalize them. However it requires some finesse to build the concrete thing quickly enough to get valuable user feedback while also not painting yourself into a corner to the point where you can't refactor later. Sorry for the platitude; I have nothing concrete to provide here.
@moody-marlin Thanks for raising this.
However, we don't want to adhere to this too strongly (especially in output), because we started with the goal of intelligently toggling between inferential and predictive applications.
What type of outputs do you expect that would be different than scikit-learn?
@mrocklin Thanks for the feedback, that's a good exercise that I can work on writing up.
@hussainsultan I would expect output more similar to statsmodels
, so something like a model object that contains p-values, AIC, BIC, Somer's D, adjusted R^2, outlier / residual tests, etc. Some of these values are still meaningful for regularized problems (e.g., R^2) while others are not (e.g., p-values).
I'd have to cast my vote to something that hews closer to a statsmodels
approach than scikit-learn
, both in terms of its API and the types of outputs it provides.
In terms of what is exposed and where, here is where my thoughts are: https://github.com/moody-marlin/dask-glm/blob/api_build/dask_glm/logistic.py
cc: @mrocklin @jcrist @hussainsultan
Throwing Chris' example into a comment here for discussion
class LogisticModel(object):
def hessian(self):
def gradient(self):
def loglike(self):
def fit(self, tol=1e-8, warm_start=None, random_state=None, max_iter=250):
def _initialize(self):
def _check_inputs(self):
def _set_intercept(self):
def __init__(self, X, y, reg=None, lamduh=0, fit_intercept=True)
I'm not qualified to judge on most of this but I'm curious what some of these methods do. Do all models implement a hessian? Or is this just used during solving if its implemented?
Is there a way to satisfy both the sklearn and statsmodels APIs? Roughly what is the statsmodels API?
Are the solvers built here tailored to Logistic regression or are they more general?
All good questions; the current solvers which are immediately generalizable to other models are: gradient_descent
, admm
, and newton
by replacing function / gradient / hessian evaluations with calls to the above methods, and by replacing the call to shrinkage
in admm
with a call to proximal_operator
which can be used for any regularizer.
In my mind, there are two big differences between sklearn and statsmodels:
.fit()
call. Statsmodels, on the other hand, is initialized with data and then the call to .fit()
accepts algorithm parameters.coefs
, and a few things about convergence. Statsmodels outputs a model object with many new attributes, including various inferential statistics and statistical tests, along with a .summary()
method to see all this.In my mind, the organization of arguments in statsmodels makes more sense, but it could be that sklearn's interface is more amenable to pipelines and deployment?
Re: the methods, the loglike / gradient / hessian methods will be used for both fitting the model parameters and for producing output statistics. _initialize()
will be used for creating initializations for the algorithms (there are some smarter ways than just all zeros that could lead to quicker convergence). _check_inputs
will check the data is chunked appropriately, and will handle differences between dask dataframes vs. dask arrays. _set_intercept
will simply add a column of all 1's to the input data.
Ok... these are my thoughts on how to make this work with sklearn interface:
from sklearn.base import BaseEstimator, ClassifierMixin
def grad(w, X, y, lambduh):
...
return out, grad
class Logistic(BaseEstimator, ClassifierMixin):
def __init__(self, tol=1e-8, lamduh=0.1):
pass
def fit(self, X, y):
n_samples, n_features = X.shape
self.beta = self._solver(X,y)
return self
def predict(self, X):
return np.sign(safe_sparse_dot(X, self.beta))
def _solver(self,X,y):
""" this can be configurable based on methods in logistic.py, similarly we could add other methods like `def _hessian` """
return dask_glm.proximal_gradient(X,y)
def summary(self):
pass
the reason i like this is that it will work with sklearn as long as all of our methods work with numpy
as well as dask.array
.
also, if it were a dask.array we could possibly use dask-learn
pipelines.
@jcrist thoughts on above?
Could you provide an example of a feature we gain by inheriting directly from sklearn
? I don't clearly see what these sklearn
classes provide us.
Scikit-learn doesn't enforce inheriting from a base class, rather it requires a duck-typed interface. However, all of scikit-learn is written expecting immediate array-likes, so I'm not sure if dask-glm classes can really be mixed with scikit-learn stuff directly (at least when passing in dask-arrays). That said, I think matching the interface is a good idea, but not necessarily subclassing (could go either way here).
The required interface is described in detail here: http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator.
also, if it were a dask.array we could possibly use dask-learn pipelines.
Those have been removed, as they were difficult to maintain, and not that useful in practice. That package now only implements model-parallelism, accelerating grid-search/randomized-search across small data. dask-glm
wouldn't really fit with the old pipelines anyway, as it evaluates immediately, while those were lazy.
thanks @jcrist I wasnt aware that duck typing was sufficient for sklearn interface. This is super helpful!
@moody-marlin i am happy with what you proposed with slight modification that we try to adhere to sklearn API. I think this is general enough that it will allow for the thing that you wanted to do. One example that i could think of was using sklearn pipelines that will work for numpy
inputs. http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html What do you think?
class LogisticModel(object):
def hessian(self):
def gradient(self):
def loglike(self):
def fit(self, tol=1e-8, warm_start=None, random_state=None, max_iter=250):
def _initialize(self):
def _check_inputs(self):
def _set_intercept(self):
def __init__(self, X, y, reg=None, lamduh=0, fit_intercept=True)
This strikes me as an inappropriate mixture of categories. Methods like gradient
and Hessian
refer, I assume, to the ability to compute individual values and derivatives of a particular function. And of course, not every model we'd like to build fitting algorithms for are differentiable (e.g., regularized models). I know I'm late to the party, but is there a technical reason we need to conflate functions and models?
I did a bit of rough tinkering to hookup statsmodels and dask / dask-glm here. The two biggest tasks were getting statsmodels Model class to accept dask arrays (that version is hacky, no error checks, no patsy), and using dask-glm
in the .fit
step.
The most awkward part was statsmodels' .fit
methods (at least the likelihood-based ones) using something like scipy.minimize
which takes a function, starting parameters, and maybe a gradient or hessian. Contrast that with gask-glm
which takes an X
, y
, and family
.
I haven't really thought through whether dask-glm splitting the optimization part off from the data part is even desirable from dask-glm's point of view. Just wanted to share some feedback.
This notebook also ran fine on a distributed system (I think) at least locally. You might want to add a .persist()
at the end of the from_array calls.
In general I like that this seems to be on a path towards figuring out a good hand-off point between the projects.
# This section is a bit awkward. From statsmodels POV, dask-glm would
# provide an optimizer f: llf, score -> opt
optimizer = self._get_optimizer()()
cc @moody-marlin thoughts on this? @TomAugspurger is this the full expected interface below?
class DaskOptimizer(Optimizer):
def _fit(f, score, start_params, fargs, kwargs, hessian,
method, disp, maxiter, callback, retall, full_output):
from dask_glm.algorithms import admm
return admm(self.exog, self.endog)
Here you hard code Logistic. Is there a general way to do this within statsmodels? Or is this patsy?
class DaskGLM(GLM):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
from dask_glm.families import Logistic
self.family = Logistic
is this the full expected interface below?
IIRC, this isn't perfectly unified across all of statsmodels. Certain fit methods / optimizers have their own arguments. I was meaning to look more closely but ran out of time.
Here you hard code Logistic. Is there a general way to do this within statsmodels? Or is this patsy?
You provide a family argument when you instantiate the model:
sm.GLM(data.endog, data.exog, family=sm.families.Gamma()
I haven't looked but I assume that statsmodels' version has all the info we need to lookup the corresponding dask-glm version.
@TomAugspurger what do you think is the right way forward here? Is this worth pursuing? What are things that should be worked on on the dask-glm side? Are there open questions for the statsmodels community?
I'll do some more experimentation, but it'll be a bit.
Are there open questions for the statsmodels community?
I'd like to get a more cleaned up version of dask + statsmodels working, to essentially replace https://github.com/statsmodels/statsmodels/blob/master/examples/notebooks/distributed_estimation.ipynb I'd prefer statsmodels users use dask for larger datasets, rather than writing their own larger-than-memory handling.
What are things that should be worked on on the dask-glm side?
Not really sure at this point. I still don't have a good feeling for how much additional work separating the optimizer would be. Should have some time in a few weeks.
On Mon, Mar 13, 2017 at 9:28 AM, Matthew Rocklin notifications@github.com wrote:
@TomAugspurger https://github.com/TomAugspurger what do you think is the right way forward here? Is this worth pursuing? What are things that should be worked on on the dask-glm side? Are there open questions for the statsmodels community?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-glm/issues/11#issuecomment-286123447, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIj-fa-dZEIjdjyjF_9xGjGZ6hs6Nks5rlVKggaJpZM4LvG12 .
If you're interested, I've taken a stab at this in https://github.com/dask/dask-glm/pull/40
One of the biggest hanging issues is to organize / discuss what the API will look like for
dask_glm
and what sorts of things we expose. This decision impacts algorithm structure going forward (as we incorporate other GLM families and regularizers) so I will need some help focusing the following discussion on specific action items. My initial thoughts were to:Aim for generality so users can create their own GLM families / regularizers: My initial thought along these lines were to have an
Optimizer(**initialization_args, **numerical_algo_args, **regularization_args)
class that can be inherited by any class that implementsfunc(), grad(), hessian()
methods. This would allow us to focus on algorithm development separately from model specification and add additional model types (Normal, Poisson, etc.) easily. The user then callsLogistic(X,y, **kwargs)
to initialize a model, and then.fit(**algo_args)
calls on the methods fromOptimizer
. This can get hairy, and could possibly prevent further optimizations that exploit specific structure inherent in the models.Hard code the models users can choose from: Per my conversation with @jcrist, we might not want to aim for extreme flexibility in the model API, as most people probably don't need / want to define their own GLM families or their own regularizers. Related, we might want to implement the algorithms specifically for each family / regularizer.
Regularized vs. unregularized models: How do we manage the 'switch' from regularized to unregularized models? Should it be something as simple as
model = LogisticModel(X, y, reg='l2')
and the user chooses from a pre-defined list of regularizers, or should we allow formodel = LogisticModel(X, y, reg=reg_func)
wherereg_func()
is any regularization function on the parameters, allowing for customizations such as grouped lasso, non-negativity, etc? Again I was hoping to keep things as general as possible so that the user can mix and match, but on the algorithm side this can get difficult especially if we want speed and accuracy.-Inherit from scikit-learn: Per my conversation with @hussainsultan, we might want to try and stick as closely as possible to
scikit-learn
's API, as this is common enough to be comfortable for most people. However, we don't want to adhere to this too strongly (especially in output), because we started with the goal of intelligently toggling between inferential and predictive applications (thus requiring things like inferential stats thatscikit-learn
currently doesn't include).Ultimately, I think my thought process boils down to: should we expose an API to access the underlying algorithms or should we bake them into the various specific GLM families we want to consider?
cc: @mrocklin @mcg1969