JuliaStats / Roadmap.jl

A centralized location for planning the direction of JuliaStats
35 stars 3 forks source link

StatisticalModel interface #4

Open johnmyleswhite opened 10 years ago

johnmyleswhite commented 10 years ago

StatsBase.jl defines a type called StatisticalModel. This type (and possibly a few others) should define the interface for a broad class of models in statistics and machine learning. Just as the Distributions package has given us a really powerful set of basic methods that we can expect all distributions to implement, I think StatisticalModel can do this as well.

So I'd like us to start to agree on which methods we expect to be implemented. It would also be good to articulate abstractions we'd like to see, even if we don't have a clear interface in mind yet.

Before starting the debate, I'd like to point out that think the interface for StatisticalModel is currently excessively focused on models that use linear predictions. Linear predictor models should probably occupy a lower position in the interface/type hierarchy, because many statistical models will not implement methods like coef or coeftable.

As we discuss, some useful sources of ideas can be found in:

Right now, StatisticalModel provides the following interface:

And the lower type, RegressionModel <: StatisticalModel defines these methods:

I think we should refine this specification considerably. Our interface should not, for example, impose frequentist ideas like confidence intervals on every statistical model. It's entirely reasonable for models to only define credible intervals -- and I think it's not a great idea if confint returns credible intervals since that constitutes a multiple dispatch pun.

Our interface should also make it easier to capture methods like multidimensional scaling, principal components analysis and other forms of dimensionality reduction. And it should capture predictive models like support vector machines, which are not probabilistic models, but are clearly statistical.

andreasnoack commented 10 years ago

I few comments for now. The present structure was implemented because I wanted the generic functions defined somewhere upstream and the type hierarchy was of lower priority. However, I did think i little bit about the two implemented levels.

Generally, I think that defining a fine grained type system for statistical models could be a difficult exercise and I am not sure the gains will be big, but I hope I am wrong.

I don't think you are right when you claim that linear predictor models is the right level for coef. coef must be defined for something more general like a semi parametric model. When I added StatisticalModel I think I had something like semi parametric models in mind. If non-parametric models are needed we have to make a distinction, but I am not sure about which generics we can define for a StatisticalModel then.

Regarding the bayesian/frequentist discussion, I would prefer a strict frequentist solution ;-), but admit that if we want to make this thing move, we have to be pragmatic. As I said in the beginning, the useful thing is to have the generics defined upstream and therefore confint should be defined upstream and then bayesians are free not to use it. Generally, I think the generics should return an error for upstream types, acting like abstract methods.

Finally, we need to figure out what a statistical model is. In my world, it is a collection of probability measures and therefore it is difficult for me to think of a statistical model that is not probabilistic, but your definition seems to be different.

nalimilan commented 10 years ago

I agree that some models may not want to implement confint(), returning an error is perfectly fine in that case. Another function can be added for credible intervals, and not all models must implement it either.

I wonder whether there really is an interesting common denominator between all the possible models. Maybe it would make more sense to have different sub-types of models, like ParametricModel, to leave non-parametric models aside. That said, as in the case of confint(), I don't think it hurts not to implement some methods -- the question is, is such an abstraction really useful?

lucasb-eyer commented 10 years ago

I think this is a very hard task and we are not the first to tackle it, so we should definitely look at what others have done as John suggested.

The only real common denominator (greatest divisor actually :)) I can see is the one the scikit-learn guys took: fit (or learn or train) and predict. Everything else varies a lot amongst models. One could think of categories (or rather "tags"/interfaces), such as 'supervised', 'unsupervised', 'statistical', 'probabilistic', ... I think creating a full type hierarchy is too prone to overfitting. Also, where would we draw the line? Is anything at least two models could provide enough to include? I think that would become a monster.

johnmyleswhite commented 10 years ago

I agree with most of the pushback to my original point.

We don't need a full type hierarchy with every possible distinction being enforced through the type system. But I do think we should try to separate out methods that are always applicable to every model versus those that require additional assumptions. I'd be happy with something closer to Base's iterable protocol.

Classification models, for example, might offer a protocol with methods like classprobs to output per-class probabilities for class assignment. This is something you'd like from most classification methods, so it would be nice to say that most models will implement it.

That said, I think @lucasb-eyer is right that fit is potentially the only truly universal method across all models. After fit, I'd say that predict is universal if you're willing to use that term for unsupervised learning. I'd personally prefer using a term like transform for those cases, but don't feel very strongly.

More generally, I'd argue that statistical models should all implement something like fit, predict and transform. Parametric models can supplement this with parameters, confint, credint, vcov, etc. We'd have to decide what those methods should typically return for non-parametric methods, but so many methods we'll want are parametric that the parametric case seems worth getting right first.

Part of what I don't like about coef is that it's not obvious why you'd use it to refer to the parameters of models that are clearly not coefficients. What would coef do for a multi-layer neural network, for example?

The value of standardization, if we're going to do much of it, will also depend on how reliable things will be. Right now StatsBase really just defines the names of generic functions, but we might want the argument types and order to be consistent across models as well. For example, should predict(model, input_vector) exist for all models?

For future reference, my personal sense is that statistical models are models used by statisticians and applied to data. This includes probabilistic models, but statistical models are a superset of probabilistic models. MDS and SVM's are two good examples of dimensionality reduction and classification tools respectively that do not define measures, even though they can be analyzed with reference to them.

For those interested in the value of generic interfaces, the carat package in R is a really powerful example of what a consistent interface can offer. That package makes it very easy to try out many different models without worrying about the exact interface offered by each model. @lindahua's code for EM also shows off the power of generic code for distributions.

lucasb-eyer commented 10 years ago

I agree with all of this and like the protocols idea, as long as these stay rather thin. For example, I don't see much more than classprobs going into a generic classifier interface. Even that is already nontrivial (but possible) for SVMs and k-NN algorithms.

If we choose fit as method name, how do we actually go about dispatch? Do we add a dummy parameter modeltype to all fit methods: model = fit(SVMFit, data, labels)? From a user's perspective, I like each method having a constructor of its name more: model = svm(data, labels), as long as everybody agrees on the main parameters. But then again, model = fit(svm(), data, labels) is not that bad and allows for fancy online-stuff like

forest = loadforest("faces.forest")
fit(forest, more_data, more_labels)

To further extend on "everybody agrees on the main parameters", not only should the main parameters' order be consistent, but also their layout: all methods which take a "list of points" should be able to handle a dataframe and a matrix with the points embedded in rows. For the matrix case, it might be better performance-wise for some algorithms to take the data in cols vs. rows, but that will be confusing for a user trying out various models. The point you made somewhere else about all iterative methods naming their cap consistently e.g. maxiter is relevant here, too.

I agree that transform is more natural at least for dimensionality reduction, though for generic programming I'd say the line between applying that and multivariate regression is fine. The thing with transform is that this word might be used by many, many other packages (graphs, 3d models/matrices, ...), while I still have to come up with a non-stats/ml package reasonably wanting to use predict. Though it's kind of sad we have to even think about this.

To the merits of a generic interface, I'd like to add that already generic fit and predict are enough to write generic parameter-search using cross-validation, leave-one-out and friends as well as one-vs-all and one-vs-one multiclass extensions to binary classifiers. We might also want to recommend models to implement Julia's serialization interface, though that might be an orthogonal issue.

lindahua commented 10 years ago

My experience with machine learning research told me that it is very difficult to come up with a "standard interface" that fit all (or most of) statistical models because of the vast diversity of the models.

However, I do see the value of a consistent interface especially for writing generic algorithms.

I think a reasonable approach is to have a guideline in interface design, which may include a list of suggested method names, the order of parameters, the names of keyword arguments & their semantics.

lindahua commented 10 years ago

Generally, API are quite different for generative and regression models:

Generative models can be considered as distributions in a general sense. The API for such models would be similar to what we have for distributions. For example, you can fit them to a set of observed data and then evaluate logpdf over a given set of samples. However, it is people often do not directly use generative models to make predictions. Therefore, predict should not be a required method for such models.

Regression models is usually trained on pairs of observations say (x, y). Such models can be used to predict y based on x. When y is a class label or a set of labels, such models are often referred to as Discriminative models -- in this sense, I think Discriminative models are just a special case here. In general, such models should implement both fit and predict.

An important lesson that I learned in implementing the MixtureModels package is that there is no general way to know the number of samples when given a set of observations -- this number, however, is necessary in implementing most algorithms. For example, each sample can be a scalar, a vector, an image, a graph, etc, and correspondingly, a set of samples can be organized as a vector, a matrix, or a cubic array, or even something more complex. To facilitate generic algorithms, it is advisable to suggest each model implement a method numsamples, with an interface numsamples(model, data).

Here is a type hierarchy that may roughly capture these concepts:

abstract StatisticalModel

abstract GenerativeModel <: StatisticalModel
abstract Distribution <: GenerativeModel

abstract RegressionModel <: StatisticalModel
abstract DiscriminativeModel <: RegressionModel
abstract Classifier <: DiscriminativeModel
johnmyleswhite commented 10 years ago

I agre with the spirit of @lindahua's proposed hierarchy. When we discussed the GenerativeModel type before, I had the impression people felt it was so similar to Distribution that we might not need both. What will distinguish GenerativeModel from Distribution?

I'm a little surprised to see Classifier is an ancestor of RegressionModel. Why? In this type hierarchy, where do logistic regression and SVM classification fall?

I agree with Dahua's suggestion that a guideline might be the best way to keep interfaces similar across models without having to create a type hierarchy that's hard to get right and could be like a straitjacket if gotten wrong.

lindahua commented 10 years ago

I think in this hierarchy Classifier is a descendant of RegressionModel.

lucasb-eyer commented 10 years ago

(Sorry, this has become rather long.)

I much prefer the idea of a guideline of functions to implement, and I am opposed to a deep type hierarchy as they usually lead to deeply rooted problems down the road.

Unfortunately, I don't agree with @lindahua's hierarchy, because:

  1. A discriminative model is not a special case of a regression model. See e.g. the introduction of A. Ng and M. Jordan which makes it very clear.
  2. A classifier doesn't have to be discriminative. One can learn a generative model of the joint P(x,y), apply Bayes' rule to get to the posterior P(y|x) and then classify the datapoint x to the label y maximizing the posterior. E.g. the naive Bayes classifier does that.

In my world view, which might be skewed, classifier/regressor and generative/discriminative are two orthogonal attributes, and can thus not be shoehorned into a tree of types. And while I agree classifiers can be seen as a special case of regression (maybe that's what @lindahua meant?), I'm not sure that always works out/makes sense in the general.

I'd rather look at verbs (functions) instead of nouns (types) and, more specifically, how a generic algorithm would want to use those verbs on different models. I'll use as examples a generic k-fold cross-validator (kCV) and a generic one-vs-all (OVA) multiclass classifier.

The kCV should be able to take as input any type of model which provides a method for the fit and predict functions, and call those.

The OVA classifier should be able to take as input any type of classifier, be it generative or discriminative, but no regressor. The difference being only that the outputs of predict are either Symbols or a set of probabilities.

@lindahua has a good point about generative models: some might just be fit and then sampled from (using a method of rand?), but never used to predict. Thus, predict is not something to be required by all model types.

Maybe I've been too coveted by Python and Go's duck typing (a blessing when coming from C++), but I don't see a type​tree here. Is there any way in julia for a type to have two parents, i.e. build a type​graph? If so, I could see (pseudocode):

abstract StatisticalModel

abstract GenerativeModel <: StatisticalModel
abstract DiscriminativeModel <: StatisticalModel

abstract Classifier <: StatisticalModel
abstract Regressor <: StatisticalModel

NaiveBayesClassifier <: Union(GenerativeModel, Classifier)
LogisticRegression <: Union(DiscriminativeModel, Classifier)
GaussianMixtureRegression <: Union(GenerativeModel, Regressor)

Stepping back a bit, as I understood, the main reason for even coming up with a hierarchy is to be able to define a set of functions (in StatsBase) which models should implement methods of, so that generic code can use multiple models implementing said functions without having Julia whine about a name clash, right?

Thus on the one hand, we want as little functions as possible because most models will not implement most functions (and a model author would rather be confronted with a small than a huge list), but on the other hand we want to cover every possible function two models might have in common, so as to make it possible to use both models in a generic code?

If Julia did automatically merge two same-named functions declared in two separate packages into being the methods of one function, we wouldn't need to define any type hierarchy and functions, except maybe for StatisticalModel, and let type inference do its thing in the generic functions. Am I missing something?

lucasb-eyer commented 10 years ago

@lindahua: I don't understand your statement about numsamples. Care to elaborate or give me pointers?

lindahua commented 10 years ago

@lucasb-eyer I understand all these things about one can learn a generative model and apply Bayesian rules to perform classification, as well as all the academic discussion.

The distinction between generative model & discriminative models in my previous post is completely from the standpoint of programming interface.

Here, regression model is generally defined as any model that concern about the relation between two x and y, here y can be a class label, a sequence of labels, a real-valued output, a segmentation map, etc. So the API for such a model might look like:

model = fit(x, y, ....)
y = predict(model, x)

Therefore, both naive bayes, or logistic regression, or even CRF can be categorized under this very general umbrella.

The entire type hierarchy is entirely based on programming API (e.g. what methods would be implemented by that model). No matter how different Naive Bayes and Logistic regression are from a math standpoint, they do share a similar interface for training and testing (as above), so I consider them under the same abstract type.

Also, I have to point out that you can't write something like:

LogisticRegression <: Union(DiscriminativeModel, Classifier)

This is simply not supported.

lucasb-eyer commented 10 years ago

Thanks @lindahua I now understand the idea behind your proposed hierarchy better, but I'm still left with a few uncertainties about it:

Finally, my guess for Classifier being a subtype of RegressionModel is that the latter would provide the predict function, which Classifier would implement to return class probabilities, but in addition Classifier would provide a classify function which returns discrete values.

(I made up the union syntax to illustrate my point, hence the "pseudocode".)

lindahua commented 10 years ago

Let me make my thoughts about the type hierarchy more explicit here.

First, we will have an abstract type StatisticalModel that covers anything related to statistical modeling (both generative & discriminative models as well as other related stuff). Due to the vast diversity of statistical models, there shouldn't be too many required methods at this level. The only method that I think most statical models may implement is fit, which is used to construct a statistical model from given observations (and possibly other prior knowledges). The interface would look like:

fit(M, data; ...)

Here M is the model type, and data is the dataset. Of course, one can provide additional options if needed. The sets of options are probably different for different model types. However, it would be useful to standardize a list of names for commonly used options, e.g. weights, maxiter, tol, etc.

Under the StatisticalModel type, we may have GenerativeModel and RegressionModel. The distinction between these two kinds of models are purely based on programming interface, and therefore might be different from the distinction in academic literatures (which is more from a mathematical standpoint).

To be specific, GenerativeModel is mainly for describing a set of observed data. From a pure mathematical perspective, there are no conceptual differences between GenerativeModel and Distribution. However, these two terms are often used in quite different contexts in practice, and the formulations are usually expressed in quite different ways. Distributions are usually used in cases where the underlying space is relatively simple (e.g. a finite set, a real space, or a vector space, etc), and a pdf can typically be given in an analytic form. Generative Models are more often expressed as a process of generation -- typically variables/objects of multiple types would be generated in this process. Typical examples of generative models include Markov random field and Bayesian networks.

For some generative models, evaluating the probability density exactly can be extremely difficult. Take Markov random field for example, evaluating the normalization constant involves integrating all possible assignments of the variables, of which the complexity grows exponentially as the size of the field increases. For such models, it is often enough to evaluate the probability density up to a scale. It is not easy to specify a set of interface that is universally applicable to all kinds of generative models. However, introducing such a abstract type allows different subtypes to be introduced under this umbrella, and then interface specification can be explored for each subtype. A preliminary thought on this would be

abstract GenerativeModel <: StatisticalModel
abstract Distribution <: GenerativeModel
abstract BayesianNetwork <: GenerativeModel
abstract MarkovRandomField <: GenerativeModel

RegressionModel targets a quite different family of models, whose main purpose is prediction. For example, given input signals, predict the output response; or given features, predict the class label. For more complex applications (e.g. conditional random field), we are given a more complex entity (e.g. an image, document, or video), and to jointly make a series of related predictions. When the prediction is for assigning class labels, we may consider it as a DiscriminativeModel. The type hierarchy thus may be outlined as

abstract RegressionModel <: StatisticalModel
abstract DiscriminativeModel <: RegressionModel
abstract Classifier <: DiscriminativeModel   # predict a single class label

# predict multiple labels jointly over some entity
abstract ConditionalRandomField <: DiscriminativeModel

Let's talk a little bit more about classifier. The basic requirement for a classifier type should be to predict the class given the observed features. The interface for this purpose would be like:

# let c be a classifier
numclasses(c)  # --> the number of classes
predict(c, x)   # --> output the predicted label
classprob(c, x)  # --> output the a probability vector as p(y|x).

Naive Bayes is a good example of using generative models for discriminative purpose. Whereas it is desirable to have NaiveBayes to inherit from both GenerativeModel and DiscriminativeModel. The lack of support of multiple inheritance makes it impossible to do so. Until this happens, a feasible workaround would be using the adaptor pattern, as follows:

type NaiveBayesModel <: BayesianNetwork
    ...
end
type NativeBayesClassifier <: Classifier
     model::NaiveBayesModel
end

In this way, you can implement a Naive Bayes model as a generative model, while providing a wrapper to endow it with classification capability. This may not be the most beautiful interface design, but I believe it is reasonable given the language restriction.

lucasb-eyer commented 10 years ago

Nicely explained; I now agree with everything you said and also with your proposed hierarchy. Let's hope for multiple (abstract) inheritance, so we don't need too many crutches.

The next step (if others agree too), is to define which functions the different models should provide. I would be in favor of starting with an as minimal as possible set and then see what else there may be as we adapt more and more packages to this interface.

One doubt which came to my mind is whether predict(c::Classifier, x) should return classes or class probabilities? In the former case classifiers would provide an additional classprob(c, x) and in the latter an additional classify(c, x). Thoughts?

b-k commented 10 years ago

Hi.

I'm the author of the Apophenia library of stats functions that got mentioned in the Julia.stats mailing list a week or two ago. Although it's still a draft and has some known bugs, I have a paper describing my take on exactly the problem of developing an object describing statistical models, at http://ben.klemens.org/pdfs/klemens-modelcats.pdf . Since I've already put a lot of thought into the question you're discussing and have already generated a working library around a consistent model object, I thought I'd share my own experience.

As discussed by a few people in this thread, a casual survey of the world of stats/ML quickly turns up some clear categorizations, which people in the thread above have pointed out: regressions, distributions, and classifiers `feel' different and tend to have different applications. So it seems intuitive that one would have different subclasses for them. I wound up not doing this, for two reasons:

--First, these distinctions are social constructs, not laws of mathematics. What started me on this journey was that I wanted to test hypotheses using a microsimulation, and the microsim tools didn't have many stats methods as part of their assumed workflow, and the stats tools didn't even think about microsims as valid models. Extensibility is about writing for use cases that nobody thought of yet, and the frameworks I could find weren't really extensible to my use case.

So, a few people in the thread above asked questions of the form `should all models M implement method X?', and the answer I shoot for with Apophenia's model object is yes for all values of (M, X) that have any mathematical validity at all.

--Second, having a hierarchy tended to not really help in implementation, because the nature of statistical models is that, outside of very generic methods like maximum likelihood, some MCMCs, or bootstrap, every operation for every model is entirely different. So I would define an abstract class, and then write a specific function for every single element of that abstract class, for every new model. Now and then I found identical behaviours: some utility functions for instrumental variables are just pointers to the same functions for OLS, but you can't even use the same predict function for OLS and IV because the standard errors are different. Perhaps somebody more experienced with building type hierarchies would get more mileage out of them than I.

Above, some people mentioned that even within one genre, it is hard to find commonalities across methods. I think a lot of this is also a social, not mathematical problem. Everybody pitches their method or model by saying that it is entirely new and different and how can you possibly compare it to the competitors, but I think of it as my job as the person implementing a framework to ignore the window dressing and find the fundamental commonalities.

Searching the mathematically-oriented literature for such commonalities, we find a consensus that statistical models can all be desribed as a parameterized likelihood function, that maps from data and parameters to a nonnegative real value (see my above-linked paper for citations). This consensus among the theorists rings true to me: if a statistical model of any sort doesn't have a likelihood function, then it's not sufficiently specified. That includes a lot of things where the textbook exposition doesn't write down the likelihood explicitly, including some classifiers, smoothers, simulations, `nonparametric' models, &c.

A lot of design decisions in Apophenia build on this. In the language here, the root of the hierarchy has only a likelihood function, and all model objects have to somehow implement it. Even if your model is nothing but a random number generator (i.e., it is a simulation), we can at least fake a likelihood function by making a million random draws, smoothing them, and using that probability mass function as a proxy.

Here at version 0.998, the apop_model object consists of:

--a pointer to the data --parameters, in a numeric grid --a list of private sets of non-numeric or otherwise structured parameters or settings. --a few utility functions (prep, print) --a set of functions that are likelihood equivalents (log_likelihood, p, RNG draw, CDF) --an estimate routine.

I had a colleague (one who spends her life doing method-of-moments estimations) who argued that even the estimate routine shouldn't be part of the model object. Practical utility won over theory on this.

Everything else is implemented via dispatch functions and vtables. For example, the Bayesian updating function checks whether the input models are on the table of conjugate distributions, need to be sent to MCMC, or otherwise handled. It does the vtable lookup by a hash of the likelihood-equivalents.

So, that's how I did it, and some of the rationale. Statistics is a fundamentally subjective field, so there are a lot of ways to do it, but I hope my discussion is useful to you as you design Julia's model object.

B

papamarkou commented 10 years ago

It took me a while to find the right location for this issue (!) but better late than never!

I would like to ask where you think that MCMCModel would fit in the StatisticalModel hierarchy, as I am not quite sure about this yet. For example, MCMCModel usually relies on a Distribution, on which it then calls logpdf(), gradloglik() (both already implemented in Distributions), and depending on the choice of MCMC sampler it may even call hessianloglik() or tensorloglik() (not implemented yet in Distributions). The distribution that is passed to the MCMCModel is either a density from Distributions or a user-defined one. I am passing along this information so that we can then jointly think on the placing of MCMCModel in the hierarchy.

So far we have MCMCLikModel as a subtype of MCMCModel in the MCMC package, which takes the log-likelihood as the distribution (and its derivatives as explained above). There have been many requests from users of the MCMC package to offer a more flexible model interface.

@simonbyrne due to your familiarity with MCMC, I am copying you in to comment if you have any further suggestions on this ;) Everyone's feedback is welcome.

lindahua commented 10 years ago

The more I think about these designs, the more I am feeling that the lack of multiple inheritance (from abstract types) makes it very difficult & awkward for orthogonal interfaces to interact.

papamarkou commented 10 years ago

Is multiple inheritance from abstract types a far-fetched reality for Julia or it is a feature that could be easily added?

papamarkou commented 10 years ago

I made some first steps today towards making MCMC more flexible to interact with Distributions. We can now pass distributions as models, I will document how to do this. It needs a little more work before it is finished.