Closed andreasnoack closed 9 years ago
That division seems quite sensible to me. Fortunately, moving that stuff over should be fairly trivial, no?
I agree with that division: estimate
and loglikelihood
should be methods of StatisticalModel
.
That said, I thought we had previously said we wanted Stats and Distributions to merge in the long run, but that's probably a ways away.
The original reason we kept kde
here is that we had thought we'd let you specify a distribution to use as the kernel. I think @simonbyrne has thought a bit about how to do that, but it's not trivial.
Yeah, I have some code somewhere, I'll see if I can find it.
I think keeping Stats and Distributions separate makes sense for the time being. I'm not sure what is the best layout, but I think making a distinction between Distribution
and StatisticalModel
is a good idea in the long run, even if it seems a bit excessive at the moment.
I think it's a really important distinction: even now, I'd say that our discriminative models (like GLM's) are examples of StatisticalModel
, but not of Distribution
. I've often though we might include a type alias for typealias GenerativeModel Distribution
.
If they're the same thing, then why not just use the name Distribution everywhere instead of having a type alias?
I have been traveling these days. Just take a look, seems there's been a lot of going on here.
I think it is useful to have both StatisticalModel
and Distribution
, with the latter being a sub-type of the former.
@johnmyleswhite: are there any differences (in terms of programming interface) between generative models and discriminative models? If there are no clear distinctions in this respect, I think we don't have to introduce GenerativeModel
and DiscriminativeModel
.
The main difference I have in mind is the existence of unconditional rand
: rand(d)
should work for every distribution object, but you'd have to type something like rand(d, X)
for non-generative models.
May not be worth having.
@lindahua Letting Distribution
be a subtype of StatisticalModel
is not possible with my proposal. I introduced StatisticalModel
in Stats
in order to define abstract statistical methods upstream. Hence the present StatisticalModel
in Stats
is really a fit = statistical model + data. We could change the name of StatisticalModel
to StatisticalFit
but I wonder if is important in practice to introduce StatisticalModel
as a super type of Distribution
.
My thought was that you can use probability distributions for many things that are not statistical models and therefore it might be better to have a clean Distributions
package without statistical stuff. The cost is that Distribution
cannot be a sub type of StatisticalModel
.
I'm not so keen on making Distribution
a subtype of StatisticalModel
either, as I think Distributions.jl should be more-or-less standalone.
One option would be to define a parametric type, say ParametricModel{D}
, which could wrap a Distribution
type into a model. We could also define constrained types, for when some parameters are known or otherwise restricted.
Let me argue for the potential virtue of a larger hierarchy of types. The main goal I have in mind is ensuring that all packages that involve statistical modeling of some sort adopt a standard interface. The place you end up in the hierarchy then tells you how much of the interface you're committed to implementing.
For instance, I'd imagine something like StatisticalModel > ProbabilisticModel > Distribution
as a descent of types with progressively greater expectations.
In this conception, a StatisticalModel
guarantees that you can call fit
and predict
and possibly something like cost
. This would let you put SVM's into a hierarchy without having to force probabilistic interpretations on them.
Below, you'd have models with explicit probabilistic interpretations, like logistic regression, that implement things like loglikelihood
in addition to predict
, fit
, etc.
Finally, you'd have models with full distributional interpretations, like Ising models or the gamma distribution, that implement things like rand
.
Getting this kind of hierarchy of interfaces right is very tricky, but I think it would make Julia a really powerful language for writing out abstract formulations of methods that apply to all probabilistic models, for example.
In a long run, it will be useful to develop an interface framework such that distributions and probabilistic models of other sort can be used through a set of consistent API.
The example in my mind is mixture models, where each component can be a distribution or others (e.g. a logistic regression model). Having Distribution
as a sub-type of StatisticalModel
is one way to achieve this -- to me, ProbabilisticModel
is actually a more proper name for such purpose.
I am open to other suggestions that may help to achieve similar goals.
I like ProbabilisticModel
a lot as a name. I also think we should have something that occupies a position above that in the type system, which deals with cases like multidimensional scaling and SVM's.
Does it add functionality to include the distributions in that hierarchy? It might have some aesthetic qualities, but would you be able to take advantage of the hierarchy relative to having the abstract statistical methods separated out? The cost of your type model is that Distributions cannot be atomic.
Disagreements over design and naming might also be more frequent if all statistical/ML infrastructure is combined within the Distributions package. There are different literatures and it might be a lot of work to figure out a type model that fits all needs.
But Distributions isn't going to be atomic. Stats and Distributions are going to be merged as soon as we can cache code for packages.
We can handle the work of providing a sufficiently expressive framework incrementally. For now we'll use something like the family of methods that R provides: predict, fit. As conflicts come up, we can decide whether to modify things or simply say that the framework can't support them.
Stats and Distributions are going to be merged as soon as we can cache code for packages.
I have missed the thread in which it was decided. Can you point me to it? just to avoid unnecessary repetition of arguments.
We largely agreed on this in https://github.com/JuliaLang/julia/issues/4168, although we held off because of licensing issues. If there are strong objections we can of course not go that route.
The scope of this package seems to have been quite stable.
Please feel free to reopen if you feel that we need further discussion.
I was about to import
loglikelihood
from Stats to avoid the conflict, but I thought it might be better to have discussion about the scope of this package. In particular, I would like to discuss where to draw the line between Stats and Distributions. For me, the natural ordering here is to let Stats depend on Distributions and have all functionality related to statistical analysis in either Stats or a package that depends on Stats, i.e. not have statistical functionality in Distributions at all. As a consequence I think such things asestimate
,loglikelihood
andkde
belong in Stats instead of Distributions. Otherwise we might just abandon Stats and pull all that stuff into Distributions. The current state is kind of messy. Please share your thoughts.