JuliaStats / StatsKit.jl

Convenience meta-package to load essential packages for statistics
Other
139 stars 16 forks source link

Add essential packages for statistics #4

Closed nalimilan closed 5 years ago

nalimilan commented 6 years ago

This makes the package useful again.

ChrisRackauckas commented 6 years ago

ManifoldLearning.jl should be forked into JuliaStats, master tagged, and added here IMO.

andreasnoack commented 6 years ago

@nalimilan Great that you are pushing this. It will make things much more user friendly. One of the open questions is how documention should be handled. Maybe examples of anaylses that uses functionality across several packages would be useful, leaving the actual API documention to the individual packages.

ManifoldLearning.jl should be forked into JuliaStats, master tagged, and added here IMO.

I'm not sure if it is an obvious candidate for inclusion here. I think the idea here is to cover the standard stuff that you'd see in stats courses.

ChrisRackauckas commented 6 years ago

Are manifold-based methods and TSne not in standard stats courses by now? I wouldn't be able to find a stats-based computational bio course without them.

nalimilan commented 6 years ago

One of the open questions is how documention should be handled. Maybe examples of anaylsis that uses functionality across several packages would be useful, leaving the actual API documention to the individual packages.

Yes, that's a difficult question. Maybe the ideal would be to have a tutorial exposing the most common features of each domain, and redirecting to packages for more details. But that's a lot of work. So maybe we can just start with links to the package's manuals on each line?

Regarding ManifoldLearning.jl, I have no idea what it is so I can't really say. One good criterion would be whether other statistical environment provide it by default.

andreasnoack commented 6 years ago

Are manifold-based methods and TSne not in standard stats courses by now? I wouldn't be able to find a stats-based computational bio course without them.

computational bio is not statistics. At least not the flavors of it that I've seen.

ChrisRackauckas commented 6 years ago

Alright, I'll leave it alone. The best solution down the line is probably to add that stuff to MultivariateStats.jl which has the other half of the commonly used dimensional reduction methods.

The others that come to mind for me are LOESS.jl and Bootstrap.jl. At least to me, anything further is probably "specialized" and those are sitting right on the cutoff line.

computational bio is not statistics. At least not the flavors of it that I've seen.

There's tons of flavors to the point where computational/systems biology needs a word in front of it to really be descriptive.

nalimilan commented 6 years ago

Good point, I've added Bootstrap and Loess. I missed the latter because it's not listed on the website, we should update it (and remove unmaintained packages). Also, shouldn't Loess be renamed to LOESS?

rofinn commented 6 years ago

Should we include RDatasets? I know people who use that for their demos.

gragusa commented 6 years ago

There is CovarianceMateices.jl. I am working on making it generic, but as it is is a nice complement to GLM.jl (in certain fields m, these variances are the standard ones).

ararslan commented 6 years ago

Want Jackknife?

mkborregaard commented 6 years ago

Great list. What about MixedModels? Would be nice to have that really integrated into the ecosystem here. In ecology at least nobody seems to do a GLM without random effects these days.

mkborregaard commented 6 years ago

variance, bias and estimator must be defined in other Stats packages than Jackknife, right? Shouldn't it be extending those functions with new methods? (sorry if this is out of place)

ararslan commented 6 years ago

Jackknife doesn't export anything, so you have to call them as Jackknife.variance, etc.

ararslan commented 6 years ago

Btw we may want to do some serious cleanup and dedicated maintenance if we're going to fully endorse all of these packages. While I think most are fine, I don't know that anybody really tends to MultivariateStats these days.

mkborregaard commented 6 years ago

Such an important package though.

ChrisRackauckas commented 6 years ago

Yeah, it's chicken and egg. I think you put it in so that way it has to be maintained. FWIW it's already widely used and right now it works. Maybe it just hasn't been touched because it's working just fine. But yes,

Such an important package though.

It has a lot of stuff in there, but at least PCA is pretty standard in most toolkits.

nalimilan commented 6 years ago

A few comments:

dmbates commented 6 years ago

I'm happy to reconcile the exported MixedModels.bootstrap function with the Bootstrap package. I'd actually forgotten that there was a Bootstrap package.

ararslan commented 6 years ago

Want MultivariateTests? It'd just have to be registered first.

nalimilan commented 6 years ago

Yeah, but why not add these tests to HypothesisTests instead?

ararslan commented 6 years ago

Yeah, I suppose they would work just fine there, good point. They were originally separate because it started as a project for my master's program. :stuck_out_tongue:

matthieugomez commented 6 years ago

I think the list should be shorter rather than longer. IMO, only packages that proved themselves useful/popular with end-users should be in this list. Otherwise, this list may give the impression that a lot of things are "done" in Julia, which is not true and which potentially stiffens innovation.

mkborregaard commented 6 years ago

What packages above does not fullfill these criteria in your opinion?

matthieugomez commented 6 years ago

I do not really know a lot of these packages. But it just seems safer to me to start with a small list of packages, and then expand it, rather than removing existing functionalities. The very short list I have in mind would look something like CategoricalArrays, CSV, DataFrames, Distances, Distributions, StatsBase, StatsModels, GLM, and maybe Clustering, TimeSeries, MultivariateStats, HypothesisTests, MixedModels

nalimilan commented 6 years ago

So basically you object about Bootstrap, KernelDensity, Loess, Jackknife and CovarianceMatrices? Care to develop why?

nalimilan commented 6 years ago

Are KDE.jl and LOESS.jl reasonably complete to be worth including?

nignatiadis commented 6 years ago

I think that MultipleTesting.jl should be included in the list of essential packages as well! Both the Benjamini-Hochberg procedure (and the related Storey procedure) are ubiquitous in high-throughput studies. R provides some of that functionality through the p.adjust function, which is probably one of the most commonly used ones. Also the MultipleTesting.jl is lightweight and would not introduce additional dependencies (and the implementations are thorough and well-tested).

cc @juliangehring

@nalimilan For what it is worth, if by KDE.jl you mean KernelDensity.jl then whenever I needed it, it has been useful and it seems to have the (basic) required functionality (and I think a nonparametric density estimator falls within the "essential" category).

mkborregaard commented 5 years ago

Hi, I'm just curious where I can learn about the plans for this package?

nalimilan commented 5 years ago

I'm not aware of any plans besides this PR. I think we should make a decision and merge it.

andreasnoack commented 5 years ago

Let's merge what's here now. We can adjust later if needed. I'll do it tomorrow if nobody objects.

nalimilan commented 5 years ago

If anybody thinks a package should be added or removed from the list, please file a new issue.