JuliaStats / Distributions.jl

A Julia package for probability distributions and associated functions.
Other
1.1k stars 414 forks source link

Unitful support? #1413

Open NAThompson opened 2 years ago

NAThompson commented 2 years ago

Using Unitful v1.9.1 and Distributions v0.25.23, it appears Unitful numbers are not supported:

julia> using Unitful
julia> using Distributions
julia> μ = 1.0u"s"
julia> σ = 0.01u"s"
julia> Normal(μ, σ)
ERROR: MethodError: no method matching Normal(::Quantity{Float64, 𝐓, Unitful.FreeUnits{(s,), 𝐓, nothing}}, ::Quantity{Float64, 𝐓, Unitful.FreeUnits{(s,), 𝐓, nothing}})

Could this be supported without undue pain?

tfiers commented 2 years ago

I am wondering the same.

Replacing some Reals by Numbers, and 0s by zero(x)s, already makes a surprising number of things work:

> using Distributions
> using Unitful: mV, Hz, s

> n = Normal(0mV, 1mV)
Normal{Unitful.Quantity{…}}(μ=0 mV, σ=1 mV)

> pdf(n, 3mV)
0.0044318484119380075 mV^-1

> λ = 4Hz;
> e = Exponential(1/λ)
Exponential{Unitful.Quantity{Float64, 𝐓, Unitful.FreeUnits{(Hz^-1,), 𝐓, nothing}}}  (θ=0.25 Hz^-1)

> rand(e) |> s
0.6119753615525725 s

How fun!

Of course more patches are needed. And some support from Unitful's side (randn(::AbstractRNG, ::Type{Quantity{T,D,U}) eg). And some distributions will need an extra "scale" or "units" parameter: μ and σ of LogNormal's default parametrisation are unitless, and do not give enough information to implement cdf and rand etc with units.

devmotion commented 2 years ago

Just replacing Real with Number is not sufficient for Normal as the example above shows - pdf, logpdf, cdf, logcdf, etc. should all return unitless numbers. Thus unfortunately I assume one would have to modify each distribution very carefully and hence probably it's easier to change each one separately.

tfiers commented 2 years ago

pdf does have units, namely the reciprocal of the independent variable's. (As the pdf is a density, we have eg 0.24 probability mass per 1 mV, at some voltage).

You can see this by doing dimensional analysis on pdf expressions. Eg for Exponential: pdf(x) = 1/θ exp(–x/θ). The scale parameter θ has the same units as x, so the pdf has one over these units. Or for Normal: pdf(x) =

where both μ and σ are in data units.

cdf is unitless (it's a probability, and also the integral of pdf(x) * dx, where dx has data units; another way to see what the units of pdf should be). logcdf is then of course also unitless. logpdf is unitless as well (as is every logarithm), but a conversion factor is needed (to get eg log(0.24/mV * mV)).

Samples (rand) and quantiles of course have units, as do various summary statistics.

Indeed many patches besides Real -> Number are needed. (Although this simple change already goes a long way, thanks to Julia's generic algorithm culture. But Distributions still has a bunch of non generic code, like 0 instead of zero, or eltype(::Sampleable) being hardcoded as Int/Float64). And of course an extensive test suite is a must. I also agree that each distribution needs to be manually examined.

devmotion commented 2 years ago

pdf does have units, namely the reciprocal of the independent variable's.

Not in general but only for distributions without atoms - ie. eg. not for DiscreteDistribution, censored distributions (open PR), or the Normal(0, 0) limit case (which is supported).

Thus it seems supporting the limit case Normal(0, 0) or censored distributions created from ContinuousDistributions would lead to type instabilities or inconsistencies.

(Clearly, logpdf is always unitless but one has to deal with a possibly arbitrary unit constant.)

sethaxen commented 2 years ago

It's unfortunate we use pdf to refer to the PDF and the PMF. For a discrete distribution, it's returning probabilities so it should be unit less, but I think for mixtures of continuous and discrete distributions, where it's a density function, the PMF of the atomic component should be implicitly multiplied by a Dirac delta function so that expectations could be taken via integrals. And by it's definition of integrating to 1, Dirac delta has units of the inverse of the variable.

devmotion commented 2 years ago

Another - but probably non-standard - approach would be to choose different base measures (they're all implicit anyway currently) that ensure that pdf is always unitless. That shifts the unit constant choice from logpdf to pdf and would work with atoms and Diracs without unit. This might seem a bit strange but in the case of Normal it would be similar to a Lebesgue measure that is scaled by 1/sqrt(2pi*sigma^2) as base measure.

devmotion commented 2 years ago

It's unfortunate we use pdf to refer to the PDF and the PMF. For a discrete distribution, it's returning probabilities so it should be unit less

It's a density in the discrete case as well, just with respect to the counting measure. So the name pdf makes sense. Nevertheless it would be helpful for eg. Truncated and Censored if one could query the (possibly zero) probability of a specific value.

KronosTheLate commented 2 years ago

+1 that unitful support would be cool ^_^

tfiers commented 1 year ago

I gave this a go a while back: https://github.com/JuliaStats/Distributions.jl/compare/master...tfiers:Distributions.jl:master

I implemented unit support for Exponential and LogNormal, and wrote tests and documentation.

Changes done to make this work:

I made sure that there was no change in current usage of Distributions.jl. Most unit operations are done in the type-domain, so the runtime impact should be minimal or zero (though I have not benchmarked that).

I started a docs page with a list of "Supported distributions" (just the two), and a guide for how to add unit support to a new distribution.

tfiers commented 1 year ago

supporting the limit case Normal(0, 0)

pdf(Normal(0mV, σ)) for σ → 0mV should be ∞ / mV no? I.e. same units (mV⁻¹) as for other values, so type stable.

tfiers commented 1 year ago

using Unitful currently takes 1 to 3 seconds (1.0 seconds on replit, 1.6 to 2.6 seconds on my laptop).

Therefore, to minimize the impact of unit support in Distributions.jl on the 99% of users that don't need it, there first needs to be created a lightweight package defining just the interface/API for working with units (UnitfulCore or UnitsInterface or sth like that). StaticArrays with StaticArraysCore is an example of this principle.

Instead of importing the full Unitful package, Distributions will then instead import just the interface package, with minimal impact on load time.

tfiers commented 1 year ago

An update, from the discussion at

with Julia 1.9's 'extension packages' (https://pkgdocs.julialang.org/dev/creating-packages/#Conditional-loading-of-code-in-packages-(Extensions)), a unitful core or interface package is no longer needed, and the full Unitful would be added as a [weakdeps], and there would presumably be a UnitfulExt module and entry under [extensions] in Distributions.jl's Project.toml.

Though, given that is 1.9+ only, I'm not sure how this interacts with Distributions.jl's current support for Julias down to 1.0


UPDATE: See this helpful summary by mbauman and chrisr that anwers these questions: https://discourse.julialang.org/t/package-extensions-for-julia-1-9/93397/5?u=tfiers