JuliaStats / Distributions.jl

A Julia package for probability distributions and associated functions.
Other
1.1k stars 414 forks source link

RFC: DistributionsBase.jl for packages defining custom distributions #1139

Open tpapp opened 4 years ago

tpapp commented 4 years ago

Distributions.jl is a high-quality implementation of many commonly used distributions, benefiting from continuous contributions and peer review from the members of the Julia community.

While it is natural to contribute commonly used distributions to this package, some distributions may not be generic enough to warrant this (eg distributions used in a very narrow subfield, invented for a specific application and not in widespread use yet, etc). We should nevertheless make it easy for distributions living in other packages to share the API without incurring the cost of dependencies that are used by Distributions.jl. Of course, distributions from such packages could be migrated to this one later on if necessary.

I am proposing that a minimal API is extracted to a small, lightweight package, which could be called DistributionsBase.jl, defining the

  1. functions cdf, pdf, ... that operate on Distributions (from the current export list,

  2. a @reexport_DistributionsBase macro that reexports these, for use by packages to make it easy to maintain a consistent common exported interface.

I am not sure that I would export the type hierarchy in the first pass, as I don't think would be used commonly by packages defining custom distributions. In any case, I would start with the bare minimum, more can be added on demand later.

(related: #525)

nickrobinson251 commented 4 years ago

What's the benefit to the interface being in a separate package? Is the idea that a MyNicheDistribution.jl package would not want to depend on Distributions.jl because it's a large dependency? (Is this a large dependency? does it take a long time to compile?) Or is there another advantage to a separate minimal "base"/"interface" package?

tpapp commented 4 years ago

Yes, Distributions.jl is a somewhat large dependency by necessity (eg ultimately a lot of special functions come from https://github.com/JuliaStats/Rmath.jl, which is a binary dependency, too). A lot of distributions could be much more lightweight.

tpapp commented 4 years ago

An additional benefit for doing something similar in Tables.jl was clarifying all details of the API.

oxinabox commented 4 years ago

A thing is its often hard to define more niche Distributions without depending on Distributions anyway. E.g. I defined PertBeta in terms of the Beta distribution. I could do it without Distributions.jl, but in that case i would end up needing to depend on SpecialFunctions.jl and/or Rmath.jl anyway.

Another thing to bear in mind is it is desirable for Distributions.jl to be a one-stop-shop for all of your distributions. So really, it would be good if niche Distribitions were just PR'd into Distributions.jl

ViralBShah commented 3 years ago

Just curious - what's the latest here? I recently heard a couple of people echoing the sentiment for this sort of refactoring.

tpapp commented 3 years ago

My understanding is that for this particular package, this does not have the support of maintainers so I didn't start any work on it.

Generally I still think that factoring out a generic API for packages like this at some point in their lifecycle is the best approach.

ViralBShah commented 3 years ago

Wondering if @simonbyrne and @andreasnoack can chime in.

cscherrer commented 3 years ago

Some design choices in Distributions can cause problems, like type constraints on constructors (making it unusable with Symbolics) and inconsistent argument checking in the form of the check_args keyword argument (making things difficult for PPL). At the same time, it grabs lots of names, and has a huge number of reverse dependencies.

At this point it's difficult to change the library itself; I've seen perfectly reasonable discussions devolve into bikeshedding. That's not meant to accuse anyone of any bad behavior; it's a natural consequence of a large codebase with a wide range of use-cases.

It's important for the Julia ecosystem to find a balance between standards and agility, and Distributions.jl is very heavily on the side of the former. As a result, there's a strong tendency to hack around its limitations. It has become so heavy-weight that in many cases people start a new package rather than try to update it. This seems to be the case with DistributionsAD.jl, and is certainly the case for MeasureTheory.jl.

I can see a great value in factoring on a DistributionsBase. Ideally, there could even be a way to allow parameterized distributions like we have in MeasureTheory, and make it more natural to build libraries like this as extensions instead of independent projects.

oschulz commented 1 year ago

We could call it DistributionsInterface, to stress that's ist mostly about function definitions, not concrete implementations. But whatever the name, I think having definitions of pdf-, cdf- and similar functions in a lightweight package would be extremely helpful.

ChrisRackauckas commented 1 year ago

Can we at least start by moving some of the abstract types to a package? Things like SciML generally don't need all of the distributions defined but just need to dispatch on AbstractSampler and call logpdf. There's a few interface pieces that could be pulled out and large swaths of the ecosystem will get a pretty big dependency reduction.

oschulz commented 1 year ago

This could also remove some function name redundancy between MeasureBase/MeasureTheory and Distributions, since MeasureBase could depend on DistributionsInterface.

devmotion commented 1 year ago

I'm in favor of extracting an interface (see https://github.com/JuliaStats/Distributions.jl/pull/1641#pullrequestreview-1186761495) but I think one should make sure it is clearly defined and consider that there are multiple things that it seems would be good to change in a next breaking release of Distributions.

Things like SciML generally don't need all of the distributions defined but just need to dispatch on AbstractSampler and call logpdf.

I looked around a bit but it seems e.g. DifferenceEquations.jl, DiffEqFlux.jl, and DiffEqBayes.jl all use concrete distributions such as MvNormal as well, so logpdf wouldn't be sufficient. Is that a general pattern or is there actually a package that could depend only on such an interface package? (There is no type of the name AbstractSampler defined in Distributions.)

oscardssmith commented 1 year ago

The point isn't DiffEqFlux and DiffEqBayes. The problem is that currentlly DiffEqBase relies on Distributions only for https://github.com/SciML/DiffEqBase.jl/blob/3fe9a5b9676caccc1c0ba2184bc1aca9d353a8a2/src/solve.jl#L1085. This one line of code add .8 seconds to the using time of DiffEqBase (a package that has 4s using time in total). Removing this dependency would save a noticable amount of time from people using OrdinaryDiffEq without Distributions (which is fairly common).

ChrisRackauckas commented 1 year ago

DiffEqBase and QuasiMonteCarlo are the two I'm thinking about. They want to dispatch on "if I see a Distribution", but don't need any distributions themselves. And as Oscar says, it's a pretty big hit to support that right now, and that 0.8 seconds can be approximately 0.

devmotion commented 1 year ago

Oscar's example doesn't use logpdf though, hence it did not show up in my search.

This gets a bit off-topic: I didn't know this functionality exists and honestly I am a bit surprised that basically all problem types support distributions as initial values - I wouldn't have thought, e.g., that's supported for e.g. ODEs (and it's not documented in https://docs.sciml.ai/DiffEqDocs/stable/types/ode_types/#SciMLBase.ODEProblem it seems, according to the docstring u0 should be an array or number). handle_distribution_u0 seems also a bit prone to surprising non-reproducibility issues since there is no way to specify an RNG and it will always use the global one, even if one specifies e.g. a seed in solve (https://github.com/SciML/StochasticDiffEq.jl/blob/454bf4ea6e21a0024d2ccbb8f05c72555884501d/src/solve.jl#L67) or possibly some other RNG (if that's supported somewhere). It seems the main reason for handle_distribution_u0 is to support random initial conditions for which a concrete value can be generated with rand. So putting the issue with the RNG aside, maybe one could just call rand on anything that's not a Number or AbstractArray (functions are handled before it seems)? And maybe it's not even needed since ::Functions are supported anyway and one could just use (_, _) -> rand(Normal()) (which would even support custom RNGs and non-Distributions distributions)?

oschulz commented 1 year ago

They want to dispatch on "if I see a Distribution", but don't need any distributions themselves.

This use case may be (partially) solvable using the trait-based DensityInterface.jl API, which both Distributions.jl and MeasureBase.jl/MeasureTheory.jl support already. DensityKind(d) tells if d has or is a density or neither, and logdensityof(d, x) can be used instead of logpdf. DensityInterface.jl has no notion of sampling, though. We could add a "supports rand" trait to DensityInterface.jl though, that might be useful anyway.

Independent of that, I still think and AbstractDistributions.jl would be very useful.

(Note: In DensityInterface.jl, DensityKind(obj) === HasDensity() is essentially equivalent to "obj is a (probability or non-normalized) measure", whereas DensityKind(obj) === IsDensity() is equivalent to "obj is the density of some measure in respect to some reference measure". Likelihoods also fall under IsDensity(). The default is DensityKind(::Any) === NoDensity().)

ChrisRackauckas commented 1 year ago

This gets a bit off-topic: I didn't know this functionality exists and honestly I am a bit surprised that basically all problem types support distributions as initial values - I wouldn't have thought, e.g., that's supported for e.g. ODEs (and it's not documented in https://docs.sciml.ai/DiffEqDocs/stable/types/ode_types/#SciMLBase.ODEProblem it seems, according to the docstring u0 should be an array or number). handle_distribution_u0 seems also a bit prone to surprising non-reproducibility issues since there is no way to specify an RNG and it will always use the global one, even if one specifies e.g. a seed in solve (https://github.com/SciML/StochasticDiffEq.jl/blob/454bf4ea6e21a0024d2ccbb8f05c72555884501d/src/solve.jl#L67) or possibly some other RNG (if that's supported somewhere). It seems the main reason for handle_distributionu0 is to support random initial conditions for which a concrete value can be generated with rand. So putting the issue with the RNG aside, maybe one could just call rand on anything that's not a Number or AbstractArray (functions are handled before it seems)? And maybe it's not even needed since ::Functions are supported anyway and one could just use (, _) -> rand(Normal()) (which would even support custom RNGs and non-Distributions distributions)?

It came before those featurse, but indeed we could deprecate it. It is off topic though and we should discuss this in DiffEqBase.