The boundary between Stats.jl & Distributions.jl

lindahua commented 10 years ago

Currently, Stats.jl and Distributions.jl are two major packages of the statistics eco-system. Since these two packages are quite related, there would be issues about what go where.

Here, I just propose a guideline. We can then discuss & revise it.

The role of Stats.jl is to provide basic supports to statistical computation, which may include the following aspects:
- computing statistics (e.g., counts, means, correlations, histograms, etc) over samples
- sampling from population.
- empirical estimation (e.g. kernel density estimation)
- other computational tools that can be used to support the implementation of statistics-related functions.
The role of Distributons.jl is to provide a hierarchical type system of distributions, which may include:
- construction of distributions
- computing statistics over a distribution (e.g. the mean of a distribution)
- sampling from a distribution
- estimating a distribution (e.g. MLE, and MAP)
- computing posterior distributions.

Generally, a function should go to Distributions.jl when it is directly pertinent to a specific kind of distributions, otherwise it should go to Stats.jl or other more appropriate packages.

Under this guideline, the function to directly sample from population (without involving any distribution types) should be moved to Stats.jl.

There are a lot of applications that only need to deal with sample statistics or just do some simple sampling (e.g. with/without replacement). For such programs, they don't have to import the Distributions package, which is quite heavy now.

cc: @johnmyleswhite @dmbates @andreasnoackjensen @simonbyrne

quinnj commented 10 years ago

Not sure if it would be helpful, but it may make sense to make Distributions a meta-package and split the different families of Distributions into separate packages. That way if another package only needs to depend on univariates, it can require just that portion. This maybe more long-term and obviously involves a lot of work, so just a thought.

johnmyleswhite commented 10 years ago

I'm happy with this proposed division.

andreasnoack commented 10 years ago

Maybe this can considered a disagreement on the meaning of the name Stats. Is a Stats package about calculating quantities (statistics) from data only, or should Stats be a package for doing statistical inference. The latter is not really feasible without probability distributions and therefore @lindahua's proposal is that the package for statistical inference is Distributions which might not be the most expected name. I vote for having statistical inference tools in Stats, but you already know that. Lets see if we can get other people's opinion.

johnmyleswhite commented 10 years ago

@andreasnoackjensen: I fully agree with your taste in naming the overarching package Stats. My agreement wih @lindahua's proposal is based entirely on his desire to keep libraries small and orthogonal.

Although it might be even worse, I would be happy renaming the current Stats to StatsBasic and then having Stats load both StatsBasic and Distributions. In the end, I do want us to provide a simple monolithic package that pulls in all of this along with GLM. I'm just trying to respect the desire of others to keep each section of statistical functionality compartmentalized, so that they can avoid loading the big monolithic package.

lindahua commented 10 years ago

My proposal is to keep this package as a minimal one that provides basic support for statistics (instead of covering everything related to stats). I do agree that the choice of Stats as the package name was unfortunate and tend to cause confusion as people may think that this package covers a much broader scope.

I basically agree with John's plan as to renaming this package to a more appropriate name (personally, I think StatsBase would be slightly better than StatsBasic though) and have a meta-package called Statistics that includes several relevant packages.

I strongly believe that having a minimal common core in a package (whatever we name it) would be beneficial in a long run as the eco-system evolves.

cc: @StefanKarpinski @ViralBShah

lindahua commented 10 years ago

I would also want to add that in current state, importing a big package is very time-consuming. In my macbook pro (with a pretty powerful i-7 core), it takes 4.5 seconds to import Distributions, while it takes 0.5 second to import Stats.

There are a lot of applications that require just a small bit of the statistical functionality (e.g. sample). For such applications, I don't feel like loading the entire Distributions module. That's one of the reason that I propose to move sample to this package.

On the other hand, this package is also a good place to put common names that other packages can import & extend.

johnmyleswhite commented 10 years ago

Is there anyone opposed to renaming Stats to StatsBase and then adding a new Stats metapackage? I'd like to make this change sooner rather than later since it will break so many people's systems.

From my perspective, the only bad thing about the Stats metapackage is that won't work quite the way I'd like since it's quite hard to inject using Foo calls into a caller's module.

StefanKarpinski commented 10 years ago

I'm still not 100% clear on the role of Stats/StatsBase, so I'm having a hard time telling if the name is good or not.

johnmyleswhite commented 10 years ago

StatsBase would only contain "simple" statistical calculations that don't depend on Distributions, DataArrays, DataFrames, etc... Things like mode, quantile, etc.

Stats would offer a "full-fledged" statistical toolkit including StatsBase, Distributions, DataArrays, DataFrames, GLM and other packages that we bless. Essentially Stats would define our default packages for statistics.

StefanKarpinski commented 10 years ago

I'm just wondering if these shouldn't go into Base or Distributions or something. It seems like not a ton of stuff.

lindahua commented 10 years ago

@andreasnoackjensen Are you ok with the idea of StatsBase for basic support and Stats.jl for providing full-fledged statistics?

The sooner we move forward, the better. Downstream packages will probably rely on this.

andreasnoack commented 10 years ago

Yes. Lets do that. It is time to settle this one. The stuff in Stats can go into StatsBase for now and then we can consider if they can be transferred to base later on as @StefanKarpinski proposed.

RegressionsModel and StatisticalModel or something similar should then be part of the grand type system of Distributions together with the generic functions in statsmodels.

Finally, the loading time of Distributions has been a concern. I think that we should consider to have a DistributionsPlatin or ExoticDistributions package and then shrink Distributions to the basic stuff. Much statistics can be done with a the normal and χ-squared.

StefanKarpinski commented 10 years ago

What's the cutoff between Distributions and ExoticDistributions? Is the worry that there will be too much code for computing the cdf, pdf, etc. of these exotic distributions?

andreasnoack commented 10 years ago

In order to do even the most basic statistics you'll have to load Distributions which is quite slow to load. If the loading time can be brought down by moving inversewishart and friends to the platinum (not the Danish platin) edition then a split might be worth it. The loading time is not a big annoyance to me, but it is slow and @lindahua also seemed to be considering the slow loading time when moving some stuff into Stats.

johnmyleswhite commented 10 years ago

My worry about trying to split Distributions and ExoticDistributions is that the dividing line is going be hard to agree on. For the work I do, the χ-squared distribution isn't relevant, but the Dirichlet distribution is essential. I suspect people with frequentist statistical backgrounds would have the exact opposite preferences.

StefanKarpinski commented 10 years ago

Given that we're precompiling Base julia already, I don't think we're all that far off from loading packages not be quite so slow. If splitting Distributions is just a matter of startup latency, I think it's definitely a premature optimization.

andreasnoack commented 10 years ago

Fine with me, then lets just forget about splitting Distributions and focus on the Stats reorganisation.

johnmyleswhite commented 10 years ago

Ok. Are we all happy with the names Stats and StatsBase?

StefanKarpinski commented 10 years ago

I don't know why but StatsCore somehow appeals to me more, but I can't put my finger on a good reason.

ViralBShah commented 10 years ago

I would go with StatsBase drawing parallel with the Julia Base.

johnmyleswhite commented 10 years ago

Well, let's pick something tomorrow and make this happen. Then we can create the new Stats package that will take 60 seconds to load as incentive to get more static compilation happening.

jpfairbanks commented 10 years ago

I also like StatsCore or StatsCommon over StatsBase, but I like that StatsBase matches Julia's Base. I also worry that segregating the Exotic Distributions to their own package contributes to the bias towards using simple distributions just because of convenience. Similar to most chalkboard examples using the Gaussian distribution because it is familiar.

lindahua commented 10 years ago

+1 for StatsBase. The other two names actually sound good -- but not as accurate in terms of summarizing the contents of this package.

nalimilan commented 10 years ago

StatsBase sounds good.

johnmyleswhite commented 10 years ago

I'm going to move ahead and do this tonight. I'll send out an e-mail with a warning once it happens.

johnmyleswhite commented 10 years ago

This is done. It seemed best to make all changes at once, so I renamed the module, GitHub repo and the METADATA entry.

johnmyleswhite commented 10 years ago

Should we close this?

lindahua commented 10 years ago

I think this has been settled.

JuliaStats / Roadmap.jl

The boundary between Stats.jl & Distributions.jl #2