Open tpapp opened 4 years ago
What's the benefit to the interface being in a separate package? Is the idea that a MyNicheDistribution.jl package would not want to depend on Distributions.jl because it's a large dependency? (Is this a large dependency? does it take a long time to compile?) Or is there another advantage to a separate minimal "base"/"interface" package?
Yes, Distributions.jl is a somewhat large dependency by necessity (eg ultimately a lot of special functions come from https://github.com/JuliaStats/Rmath.jl, which is a binary dependency, too). A lot of distributions could be much more lightweight.
An additional benefit for doing something similar in Tables.jl was clarifying all details of the API.
A thing is its often hard to define more niche Distributions without depending on Distributions anyway.
E.g. I defined PertBeta in terms of the Beta
distribution.
I could do it without Distributions.jl, but in that case i would end up needing to depend on SpecialFunctions.jl and/or Rmath.jl anyway.
Another thing to bear in mind is it is desirable for Distributions.jl to be a one-stop-shop for all of your distributions. So really, it would be good if niche Distribitions were just PR'd into Distributions.jl
Just curious - what's the latest here? I recently heard a couple of people echoing the sentiment for this sort of refactoring.
My understanding is that for this particular package, this does not have the support of maintainers so I didn't start any work on it.
Generally I still think that factoring out a generic API for packages like this at some point in their lifecycle is the best approach.
Wondering if @simonbyrne and @andreasnoack can chime in.
Some design choices in Distributions can cause problems, like type constraints on constructors (making it unusable with Symbolics) and inconsistent argument checking in the form of the check_args
keyword argument (making things difficult for PPL). At the same time, it grabs lots of names, and has a huge number of reverse dependencies.
At this point it's difficult to change the library itself; I've seen perfectly reasonable discussions devolve into bikeshedding. That's not meant to accuse anyone of any bad behavior; it's a natural consequence of a large codebase with a wide range of use-cases.
It's important for the Julia ecosystem to find a balance between standards and agility, and Distributions.jl is very heavily on the side of the former. As a result, there's a strong tendency to hack around its limitations. It has become so heavy-weight that in many cases people start a new package rather than try to update it. This seems to be the case with DistributionsAD.jl, and is certainly the case for MeasureTheory.jl.
I can see a great value in factoring on a DistributionsBase. Ideally, there could even be a way to allow parameterized distributions like we have in MeasureTheory, and make it more natural to build libraries like this as extensions instead of independent projects.
We could call it DistributionsInterface, to stress that's ist mostly about function definitions, not concrete implementations. But whatever the name, I think having definitions of pdf-, cdf- and similar functions in a lightweight package would be extremely helpful.
Can we at least start by moving some of the abstract types to a package? Things like SciML generally don't need all of the distributions defined but just need to dispatch on AbstractSampler
and call logpdf
. There's a few interface pieces that could be pulled out and large swaths of the ecosystem will get a pretty big dependency reduction.
This could also remove some function name redundancy between MeasureBase/MeasureTheory and Distributions, since MeasureBase could depend on DistributionsInterface.
I'm in favor of extracting an interface (see https://github.com/JuliaStats/Distributions.jl/pull/1641#pullrequestreview-1186761495) but I think one should make sure it is clearly defined and consider that there are multiple things that it seems would be good to change in a next breaking release of Distributions.
Things like SciML generally don't need all of the distributions defined but just need to dispatch on
AbstractSampler
and calllogpdf
.
I looked around a bit but it seems e.g. DifferenceEquations.jl, DiffEqFlux.jl, and DiffEqBayes.jl all use concrete distributions such as MvNormal
as well, so logpdf
wouldn't be sufficient. Is that a general pattern or is there actually a package that could depend only on such an interface package? (There is no type of the name AbstractSampler
defined in Distributions.)
The point isn't DiffEqFlux and DiffEqBayes. The problem is that currentlly DiffEqBase
relies on Distributions
only for https://github.com/SciML/DiffEqBase.jl/blob/3fe9a5b9676caccc1c0ba2184bc1aca9d353a8a2/src/solve.jl#L1085. This one line of code add .8 seconds to the using time of DiffEqBase
(a package that has 4s using time in total). Removing this dependency would save a noticable amount of time from people using OrdinaryDiffEq
without Distributions
(which is fairly common).
DiffEqBase and QuasiMonteCarlo are the two I'm thinking about. They want to dispatch on "if I see a Distribution", but don't need any distributions themselves. And as Oscar says, it's a pretty big hit to support that right now, and that 0.8 seconds can be approximately 0.
Oscar's example doesn't use logpdf
though, hence it did not show up in my search.
This gets a bit off-topic: I didn't know this functionality exists and honestly I am a bit surprised that basically all problem types support distributions as initial values - I wouldn't have thought, e.g., that's supported for e.g. ODEs (and it's not documented in https://docs.sciml.ai/DiffEqDocs/stable/types/ode_types/#SciMLBase.ODEProblem it seems, according to the docstring u0
should be an array or number). handle_distribution_u0
seems also a bit prone to surprising non-reproducibility issues since there is no way to specify an RNG and it will always use the global one, even if one specifies e.g. a seed in solve
(https://github.com/SciML/StochasticDiffEq.jl/blob/454bf4ea6e21a0024d2ccbb8f05c72555884501d/src/solve.jl#L67) or possibly some other RNG (if that's supported somewhere). It seems the main reason for handle_distribution_u0
is to support random initial conditions for which a concrete value can be generated with rand
. So putting the issue with the RNG aside, maybe one could just call rand
on anything that's not a Number
or AbstractArray
(functions are handled before it seems)? And maybe it's not even needed since ::Function
s are supported anyway and one could just use (_, _) -> rand(Normal())
(which would even support custom RNGs and non-Distributions distributions)?
They want to dispatch on "if I see a Distribution", but don't need any distributions themselves.
This use case may be (partially) solvable using the trait-based DensityInterface.jl API, which both Distributions.jl and MeasureBase.jl/MeasureTheory.jl support already. DensityKind(d)
tells if d
has or is a density or neither, and logdensityof(d, x)
can be used instead of logpdf
. DensityInterface.jl has no notion of sampling, though. We could add a "supports rand" trait to DensityInterface.jl though, that might be useful anyway.
Independent of that, I still think and AbstractDistributions.jl would be very useful.
(Note: In DensityInterface.jl, DensityKind(obj) === HasDensity()
is essentially equivalent to "obj
is a (probability or non-normalized) measure", whereas DensityKind(obj) === IsDensity()
is equivalent to "obj
is the density of some measure in respect to some reference measure". Likelihoods also fall under IsDensity()
. The default is DensityKind(::Any) === NoDensity()
.)
This gets a bit off-topic: I didn't know this functionality exists and honestly I am a bit surprised that basically all problem types support distributions as initial values - I wouldn't have thought, e.g., that's supported for e.g. ODEs (and it's not documented in https://docs.sciml.ai/DiffEqDocs/stable/types/ode_types/#SciMLBase.ODEProblem it seems, according to the docstring u0 should be an array or number). handle_distribution_u0 seems also a bit prone to surprising non-reproducibility issues since there is no way to specify an RNG and it will always use the global one, even if one specifies e.g. a seed in solve (https://github.com/SciML/StochasticDiffEq.jl/blob/454bf4ea6e21a0024d2ccbb8f05c72555884501d/src/solve.jl#L67) or possibly some other RNG (if that's supported somewhere). It seems the main reason for handle_distributionu0 is to support random initial conditions for which a concrete value can be generated with rand. So putting the issue with the RNG aside, maybe one could just call rand on anything that's not a Number or AbstractArray (functions are handled before it seems)? And maybe it's not even needed since ::Functions are supported anyway and one could just use (, _) -> rand(Normal()) (which would even support custom RNGs and non-Distributions distributions)?
It came before those featurse, but indeed we could deprecate it. It is off topic though and we should discuss this in DiffEqBase.
Distributions.jl is a high-quality implementation of many commonly used distributions, benefiting from continuous contributions and peer review from the members of the Julia community.
While it is natural to contribute commonly used distributions to this package, some distributions may not be generic enough to warrant this (eg distributions used in a very narrow subfield, invented for a specific application and not in widespread use yet, etc). We should nevertheless make it easy for distributions living in other packages to share the API without incurring the cost of dependencies that are used by Distributions.jl. Of course, distributions from such packages could be migrated to this one later on if necessary.
I am proposing that a minimal API is extracted to a small, lightweight package, which could be called DistributionsBase.jl, defining the
functions
cdf
,pdf
, ... that operate on Distributions (from the current export list,a
@reexport_DistributionsBase
macro that reexports these, for use by packages to make it easy to maintain a consistent common exported interface.I am not sure that I would export the type hierarchy in the first pass, as I don't think would be used commonly by packages defining custom distributions. In any case, I would start with the bare minimum, more can be added on demand later.
(related: #525)