Scope of Distributions.jl

andreasnoack commented 10 years ago

I was about to import loglikelihood from Stats to avoid the conflict, but I thought it might be better to have discussion about the scope of this package. In particular, I would like to discuss where to draw the line between Stats and Distributions. For me, the natural ordering here is to let Stats depend on Distributions and have all functionality related to statistical analysis in either Stats or a package that depends on Stats, i.e. not have statistical functionality in Distributions at all. As a consequence I think such things as estimate, loglikelihood and kde belong in Stats instead of Distributions. Otherwise we might just abandon Stats and pull all that stuff into Distributions. The current state is kind of messy. Please share your thoughts.

StefanKarpinski commented 10 years ago

That division seems quite sensible to me. Fortunately, moving that stuff over should be fairly trivial, no?

johnmyleswhite commented 10 years ago

I agree with that division: estimate and loglikelihood should be methods of StatisticalModel.

That said, I thought we had previously said we wanted Stats and Distributions to merge in the long run, but that's probably a ways away.

The original reason we kept kde here is that we had thought we'd let you specify a distribution to use as the kernel. I think @simonbyrne has thought a bit about how to do that, but it's not trivial.

simonbyrne commented 10 years ago

Yeah, I have some code somewhere, I'll see if I can find it.

I think keeping Stats and Distributions separate makes sense for the time being. I'm not sure what is the best layout, but I think making a distinction between Distribution and StatisticalModel is a good idea in the long run, even if it seems a bit excessive at the moment.

johnmyleswhite commented 10 years ago

I think it's a really important distinction: even now, I'd say that our discriminative models (like GLM's) are examples of StatisticalModel, but not of Distribution. I've often though we might include a type alias for typealias GenerativeModel Distribution.

StefanKarpinski commented 10 years ago

If they're the same thing, then why not just use the name Distribution everywhere instead of having a type alias?

lindahua commented 10 years ago

I have been traveling these days. Just take a look, seems there's been a lot of going on here.

I think it is useful to have both StatisticalModel and Distribution, with the latter being a sub-type of the former.

@johnmyleswhite: are there any differences (in terms of programming interface) between generative models and discriminative models? If there are no clear distinctions in this respect, I think we don't have to introduce GenerativeModel and DiscriminativeModel.

johnmyleswhite commented 10 years ago

The main difference I have in mind is the existence of unconditional rand: rand(d) should work for every distribution object, but you'd have to type something like rand(d, X) for non-generative models.

May not be worth having.

andreasnoack commented 10 years ago

@lindahua Letting Distribution be a subtype of StatisticalModel is not possible with my proposal. I introduced StatisticalModel in Stats in order to define abstract statistical methods upstream. Hence the present StatisticalModel in Stats is really a fit = statistical model + data. We could change the name of StatisticalModel to StatisticalFit but I wonder if is important in practice to introduce StatisticalModel as a super type of Distribution.

My thought was that you can use probability distributions for many things that are not statistical models and therefore it might be better to have a clean Distributions package without statistical stuff. The cost is that Distribution cannot be a sub type of StatisticalModel.

simonbyrne commented 10 years ago

I'm not so keen on making Distribution a subtype of StatisticalModel either, as I think Distributions.jl should be more-or-less standalone.

One option would be to define a parametric type, say ParametricModel{D}, which could wrap a Distribution type into a model. We could also define constrained types, for when some parameters are known or otherwise restricted.

johnmyleswhite commented 10 years ago

Let me argue for the potential virtue of a larger hierarchy of types. The main goal I have in mind is ensuring that all packages that involve statistical modeling of some sort adopt a standard interface. The place you end up in the hierarchy then tells you how much of the interface you're committed to implementing.

For instance, I'd imagine something like StatisticalModel > ProbabilisticModel > Distribution as a descent of types with progressively greater expectations.

In this conception, a StatisticalModel guarantees that you can call fit and predict and possibly something like cost. This would let you put SVM's into a hierarchy without having to force probabilistic interpretations on them.

Below, you'd have models with explicit probabilistic interpretations, like logistic regression, that implement things like loglikelihood in addition to predict, fit, etc.

Finally, you'd have models with full distributional interpretations, like Ising models or the gamma distribution, that implement things like rand.

Getting this kind of hierarchy of interfaces right is very tricky, but I think it would make Julia a really powerful language for writing out abstract formulations of methods that apply to all probabilistic models, for example.

lindahua commented 10 years ago

In a long run, it will be useful to develop an interface framework such that distributions and probabilistic models of other sort can be used through a set of consistent API.

The example in my mind is mixture models, where each component can be a distribution or others (e.g. a logistic regression model). Having Distribution as a sub-type of StatisticalModel is one way to achieve this -- to me, ProbabilisticModel is actually a more proper name for such purpose.

I am open to other suggestions that may help to achieve similar goals.

johnmyleswhite commented 10 years ago

I like ProbabilisticModel a lot as a name. I also think we should have something that occupies a position above that in the type system, which deals with cases like multidimensional scaling and SVM's.

andreasnoack commented 10 years ago

Does it add functionality to include the distributions in that hierarchy? It might have some aesthetic qualities, but would you be able to take advantage of the hierarchy relative to having the abstract statistical methods separated out? The cost of your type model is that Distributions cannot be atomic.

Disagreements over design and naming might also be more frequent if all statistical/ML infrastructure is combined within the Distributions package. There are different literatures and it might be a lot of work to figure out a type model that fits all needs.

johnmyleswhite commented 10 years ago

But Distributions isn't going to be atomic. Stats and Distributions are going to be merged as soon as we can cache code for packages.

We can handle the work of providing a sufficiently expressive framework incrementally. For now we'll use something like the family of methods that R provides: predict, fit. As conflicts come up, we can decide whether to modify things or simply say that the framework can't support them.

andreasnoack commented 10 years ago

Stats and Distributions are going to be merged as soon as we can cache code for packages.

I have missed the thread in which it was decided. Can you point me to it? just to avoid unnecessary repetition of arguments.

johnmyleswhite commented 10 years ago

We largely agreed on this in https://github.com/JuliaLang/julia/issues/4168, although we held off because of licensing issues. If there are strong objections we can of course not go that route.

lindahua commented 9 years ago

The scope of this package seems to have been quite stable.

Please feel free to reopen if you feel that we need further discussion.

JuliaStats / Distributions.jl

Scope of Distributions.jl #162