JuliaStats / Distributions.jl

A Julia package for probability distributions and associated functions.
Other
1.1k stars 414 forks source link

Separate multivariate distributions into a separate package? #658

Open ararslan opened 7 years ago

ararslan commented 7 years ago

Discussed elsewhere, opening an issue to track it.

Since they have a slightly different interface and could benefit from different optimizations, @andreasnoack has suggested we separate multivariate distributions (and probably mixture models as well) into a separate package and keep Distributions.jl for univariate distributions.

rofinn commented 7 years ago

Do they need to be in separate packages though or would submodules suffice for now?

ararslan commented 7 years ago

Is Pkg set up to fetch submodules when updating?

rofinn commented 7 years ago

Sorry, I meant a julia submodule, just to help organizing things better before making a bunch of repos. I imagine once they're sufficiently isolated we'll be able to make Distributions.jl a wrapper package that re-exports things from the other APIs (e.g., AbstractDistributions.jl, UnivariateDistributions.jl, MultivariateDistributions.jl)

mschauer commented 7 years ago

At least for the Gaussian distribution, I would like to not separate the one and the n-dimensional case into different objects and hard-code various implementation details (isotropic or not). I'd rather give μ::T and σ or σσ' and provide implementations which regardless whether μ is a number or a vector or a fixed vector - as long as norm(σ\(x-μ))^2 is meaningful - dispatch to do the right thing. Compare with https://github.com/mschauer/Bridge.jl/blob/master/src/gaussian.jl

simonbyrne commented 7 years ago

@mschauer The mathematician in me is somewhat sympathetic to the idea, but not sure how feasible and practical it actually is. For examplecdf is only really applicable to univariate distributions (yes, it can be defined for multivariate distributions, but it is typically too difficult to compute outside of simple cases).

Can you expand on your use case? I'm not sure what your example is trying to show.

andreasnoack commented 7 years ago

I guess that we could just overload cdf for the cases that can be supported?

mschauer commented 7 years ago

My code to simulate linear stochastic processes is indifferent about the underlying type say as long as randn(T) gives a Gaussian increment.

In general, if I can write randn(T) and σ*randn(T) + μ, I would like Normal(zero(T)) and Normal(μ, σ*σ') being able to reflect the known properties the corresponding distribution. This is not a case where lets say generality contradicts usefulness.

Example:

using Bridge, StaticArrays
Ts = [Float64, Complex{Float64}, SVector{2,Float64}]
Ps = [Wiener{T}() for T in Ts]
Xs = [sample([0.0, 0.5, 1.0], P)[end] for P in Ps]

gives

 1.0 => -0.36573           
 1.0 => -0.17426-0.234315im
 1.0 => [1.07265, -0.13575]

Suddenly in https://github.com/mschauer/Bridge.jl/blob/master/src/linpro.jl#L32 I cannot rely on Distributions to give the law of the increment.

simonbyrne commented 7 years ago

Thanks, that was useful, though I'm not sure it is really feasible in the current structure, e.g. we would lose the property that Normal <: UnivariateDistribution.

What would be the downside of simply dispatching on the appropriate distribution type, e.g. Normal, MvNormal, ComplexNormal, MatrixNormal (the latter 2 don't exist yet, but could in future).

simonbyrne commented 7 years ago

Actually ComplexNormal and MatrixNormal are good counter-examples: both are actually 3-parameter distributions: ComplexNormal has a mean, covariance and a pseudo-covariance, and MatrixNormal has a mean and left and right covariance matrices.

mschauer commented 7 years ago

One would really focus on the case σ*randn(T) + μ, which for complex values makes the mixing pseudo-covariance zero.

matbesancon commented 5 years ago

I find it practical to have both in the same package, is this still an open discussion?

andreasnoack commented 5 years ago

Since nothing has happened, it should be fine to continue the discussion. While it is practical, it comes at a significant price.

  1. Many packages have Arpack as an unnecessary dependency because the multivariate distributions use PDMats which depends on Arpack.
  2. The multivariate distributions don't work with StaticArrays although it would be much better in most applications. Part of the reason is that we don't want to add another fairly compilation heavy dependency to Distributions.
mschauer commented 5 years ago

Static arrays can be supported to a large extend without depending on the StaticArray package I suppose.

matbesancon commented 5 years ago

Yes using the right abstractions could let us avoid depending on anything heavy. Lots of PRs lately have been abound correct parametrization of types to increase interop. Maybe we could do something similar for PDMats? One pain point on this is that the PDMats package defines the interface for PD matrices, but also the heavy Arpack-dependent implementation

mschauer commented 5 years ago

The PDMats dependence could be replaced with an AbstractPDMats, at the cost of users of MvNormals providing actual factorizations for the covariance matrices themselves, which are then required to define unwhiten etc, as in https://github.com/mschauer/GaussianDistributions.jl (I am sorry for linking this repo so often in our discussions, honestly it is mostly a sketch.)

matbesancon commented 5 years ago

Agreed, couldn't AbstracPDMats be defined in a separate package from the implementation? Or even better, moved to LinearAlgebra (not sure how possible this would be). Removing a heavy and not necessary dependency would greatly lighten Distributions

andreasnoack commented 5 years ago

Actually, I don't think we need a special type for this. PDness is a property of the values anyway in the same way that σ>0 is value-dependent for Normal. I think we could just have a multivariate normal type that is parameterized on Cholesky-like factor S which could be either a matrix or factorization as long as S'S=Σ.

mschauer commented 5 years ago

Yes, one has to be a bit careful, because one would either like to give S or a factorisation object for Σ, which of course informs about S, but is not S.

andreasnoack commented 5 years ago

Maybe #823 can help us here.

matbesancon commented 5 years ago

to dispatch on covar_mat=... vs factorization = ...?

andreasnoack commented 5 years ago

No, as a way to specify that you pass in a covariance matrix instead of the "square root" of the covariance.

simonbyrne commented 5 years ago

I had planned to tackle this after #823: my general plan was to get rid of PDMats and just work with factorizations (or Diagonals), and make use of FillArrays.jl, but i'm pretty short on time at the moment...

tpapp commented 2 years ago

Revisiting this issue, reflecting on the experience with LogExpFunctions.jl. A year after the suggestion (https://github.com/JuliaStats/StatsFuns.jl/issues/46) to move it out of StatsFuns.jl 4 years ago, the code was transferred into a separate package.

Despite its small size, the package got quite a few PRs. For the authors, a smaller package is easier to contribute to. For the maintainers (especially @devmotion, who did a lot of work), a smaller codebase allows PRs to get reviewed and merged relatively quickly. I think this is the key factor; dependencies are kind of secondary, but of course it is always nice to have fewer of them.

Practically, I would recommend addressing #1139 first, then splitting into uni- and multivariate distributions.