Closed lindahua closed 10 years ago
Not sure if it would be helpful, but it may make sense to make Distributions a meta-package and split the different families of Distributions into separate packages. That way if another package only needs to depend on univariates, it can require just that portion. This maybe more long-term and obviously involves a lot of work, so just a thought.
I'm happy with this proposed division.
Maybe this can considered a disagreement on the meaning of the name Stats
. Is a Stats
package about calculating quantities (statistics) from data only, or should Stats
be a package for doing statistical inference. The latter is not really feasible without probability distributions and therefore @lindahua's proposal is that the package for statistical inference is Distributions
which might not be the most expected name. I vote for having statistical inference tools in Stats
, but you already know that. Lets see if we can get other people's opinion.
@andreasnoackjensen: I fully agree with your taste in naming the overarching package Stats
. My agreement wih @lindahua's proposal is based entirely on his desire to keep libraries small and orthogonal.
Although it might be even worse, I would be happy renaming the current Stats
to StatsBasic
and then having Stats
load both StatsBasic
and Distributions
. In the end, I do want us to provide a simple monolithic package that pulls in all of this along with GLM
. I'm just trying to respect the desire of others to keep each section of statistical functionality compartmentalized, so that they can avoid loading the big monolithic package.
My proposal is to keep this package as a minimal one that provides basic support for statistics (instead of covering everything related to stats). I do agree that the choice of Stats
as the package name was unfortunate and tend to cause confusion as people may think that this package covers a much broader scope.
I basically agree with John's plan as to renaming this package to a more appropriate name (personally, I think StatsBase
would be slightly better than StatsBasic
though) and have a meta-package called Statistics
that includes several relevant packages.
I strongly believe that having a minimal common core in a package (whatever we name it) would be beneficial in a long run as the eco-system evolves.
cc: @StefanKarpinski @ViralBShah
I would also want to add that in current state, importing a big package is very time-consuming. In my macbook pro (with a pretty powerful i-7 core), it takes 4.5 seconds to import Distributions, while it takes 0.5 second to import Stats.
There are a lot of applications that require just a small bit of the statistical functionality (e.g. sample
). For such applications, I don't feel like loading the entire Distributions module. That's one of the reason that I propose to move sample
to this package.
On the other hand, this package is also a good place to put common names that other packages can import & extend.
Is there anyone opposed to renaming Stats to StatsBase and then adding a new Stats metapackage? I'd like to make this change sooner rather than later since it will break so many people's systems.
From my perspective, the only bad thing about the Stats metapackage is that won't work quite the way I'd like since it's quite hard to inject using Foo
calls into a caller's module.
I'm still not 100% clear on the role of Stats/StatsBase, so I'm having a hard time telling if the name is good or not.
StatsBase would only contain "simple" statistical calculations that don't depend on Distributions, DataArrays, DataFrames, etc... Things like mode
, quantile
, etc.
Stats would offer a "full-fledged" statistical toolkit including StatsBase, Distributions, DataArrays, DataFrames, GLM and other packages that we bless. Essentially Stats would define our default packages for statistics.
I'm just wondering if these shouldn't go into Base or Distributions or something. It seems like not a ton of stuff.
@andreasnoackjensen Are you ok with the idea of StatsBase
for basic support and Stats.jl
for providing full-fledged statistics?
The sooner we move forward, the better. Downstream packages will probably rely on this.
Yes. Lets do that. It is time to settle this one. The stuff in Stats
can go into StatsBase
for now and then we can consider if they can be transferred to base later on as @StefanKarpinski proposed.
RegressionsModel
and StatisticalModel
or something similar should then be part of the grand type system of Distributions together with the generic functions in statsmodels.
Finally, the loading time of Distributions
has been a concern. I think that we should consider to have a DistributionsPlatin
or ExoticDistributions
package and then shrink Distributions
to the basic stuff. Much statistics can be done with a the normal and χ-squared.
What's the cutoff between Distributions and ExoticDistributions? Is the worry that there will be too much code for computing the cdf, pdf, etc. of these exotic distributions?
In order to do even the most basic statistics you'll have to load Distributions
which is quite slow to load. If the loading time can be brought down by moving inversewishart
and friends to the platinum (not the Danish platin) edition then a split might be worth it. The loading time is not a big annoyance to me, but it is slow and @lindahua also seemed to be considering the slow loading time when moving some stuff into Stats
.
My worry about trying to split Distributions and ExoticDistributions is that the dividing line is going be hard to agree on. For the work I do, the χ-squared distribution isn't relevant, but the Dirichlet distribution is essential. I suspect people with frequentist statistical backgrounds would have the exact opposite preferences.
Given that we're precompiling Base julia already, I don't think we're all that far off from loading packages not be quite so slow. If splitting Distributions is just a matter of startup latency, I think it's definitely a premature optimization.
Fine with me, then lets just forget about splitting Distributions and focus on the Stats
reorganisation.
Ok. Are we all happy with the names Stats
and StatsBase
?
I don't know why but StatsCore
somehow appeals to me more, but I can't put my finger on a good reason.
I would go with StatsBase
drawing parallel with the Julia Base
.
Well, let's pick something tomorrow and make this happen. Then we can create the new Stats package that will take 60 seconds to load as incentive to get more static compilation happening.
I also like StatsCore
or StatsCommon
over StatsBase
, but I like that StatsBase
matches Julia's Base
. I also worry that segregating the Exotic Distributions to their own package contributes to the bias towards using simple distributions just because of convenience. Similar to most chalkboard examples using the Gaussian distribution because it is familiar.
+1 for StatsBase. The other two names actually sound good -- but not as accurate in terms of summarizing the contents of this package.
StatsBase
sounds good.
I'm going to move ahead and do this tonight. I'll send out an e-mail with a warning once it happens.
This is done. It seemed best to make all changes at once, so I renamed the module, GitHub repo and the METADATA entry.
Should we close this?
I think this has been settled.
Currently, Stats.jl and Distributions.jl are two major packages of the statistics eco-system. Since these two packages are quite related, there would be issues about what go where.
Here, I just propose a guideline. We can then discuss & revise it.
Generally, a function should go to Distributions.jl when it is directly pertinent to a specific kind of distributions, otherwise it should go to Stats.jl or other more appropriate packages.
Under this guideline, the function to directly sample from population (without involving any distribution types) should be moved to Stats.jl.
There are a lot of applications that only need to deal with sample statistics or just do some simple sampling (e.g. with/without replacement). For such programs, they don't have to import the Distributions package, which is quite heavy now.
cc: @johnmyleswhite @dmbates @andreasnoackjensen @simonbyrne