JuliaStats / Distributions.jl

A Julia package for probability distributions and associated functions.
Other
1.1k stars 414 forks source link

What `Distributions.pdf` returns, and when? #1820

Open aplavin opened 9 months ago

aplavin commented 9 months ago

The docstring simply states:

  pdf(d::UnivariateDistribution, x::Real)

  Evaluate the probability density (mass) at x.

This surely isn't enough to know when it returns density, and when mass. I thought I kinda knew the answer to this question, but recently turned out I was wrong.

I've always thought it returns density for Continous distributions, and mass for Discrete. But actually there are distributions for which pdf(d, x) sometimes returns mass and sometimes density – for the same d. One example is censored(...) distribution.

Would be nice to add some clarification to the pdf docstring to make writing generic code feasible. Currently, pdf doesn't guarantee anything useful unless one only considers specific individual distributions. Code like f(d) = ... pdf(d, x) ... without explicit constrains on d is impossible to reason about.

devmotion commented 9 months ago

It always returns "the" probability density, the main question is just with respect to which base measure. Typically for discrete discrete distributions the base measure is the counting measure, which implies that the density function coincides with the probability mass function (see eg https://en.wikipedia.org/wiki/Probability_mass_function#Measure_theoretic_formulation).

aplavin commented 9 months ago

That's what I thought originally: regular reals measure for Continous, counting measure for Discrete. But what the measure is for censored() then?

sethaxen commented 9 months ago

Censored is an example of a probability measure of mixed type (see e.g. https://www.randomservices.org/random/dist/Mixed.html). While a density function is generally a Radon-Nikodym derivative of a probability measure wrt another measure (which is the measure wrt which it is absolutely continuous), I don't think this concept extends easily to mixed type probability measures. However, they have a clear notion of partial density, which depending on the base measure would give either the density of the continuous component or the discrete component.

For mixture distributions in general it might be more sensible to define the density as the probability-weighted sum of densities of the components wrt their respective measures instead of defining as absolutely continuous wrt some base measure. This definition would still support inference using censored data.

aplavin commented 9 months ago

That's a bit over my head... Let me just share my practical concern and motivation for opening this issue.

It was nice to see that Distributions defines the censored distribution with an interface seemingly consistent with all other distributions. But then I noticed that it's basically impossible to use this distribution with a generic function that does some computations based of pdf/logpdf. For example, plugging censored(...) as part of the prior or likelihood into a bayesian analysis procedure would clearly lead to wrong inferences. pdf(d, x) of d = censored(Normal(0, 1), 0, Inf) is effectively indistinguishable from truncated(Normal(0, 1), 0, Inf) aside from 2x different normalization, and inference results will be the same for both.

Do you think it is possible at all to use Distributions.pdf generically? If so, how to avoid this kind of silently wrong results?

mschauer commented 8 months ago
julia> d = censored(Normal(0, 1), 0, Inf)
Censored(Normal{Float64}(μ=0.0, σ=1.0); lower=0.0, upper=Inf)

julia> pdf(d, 0.0001)
0.3989422784067213

julia> pdf(d, 0)
0.5

julia> pdf(d, -0.0001)
0.0

I think a digestible statement is: pdf(..., x) returns the probability density function or zero except when the probability of getting x exactly is larger than 0 (that is, x is an atom)

aplavin commented 8 months ago

Yeah, that's what it seems to return. Then I believe it's indeed impossible to write a generic function that uses Distributions.pdf() and computes something nontrivial...

mschauer commented 8 months ago

What do you mean with the last statement? Is this good or bad?

aplavin commented 8 months ago

I mean that generic functions that use pdf, like f(dist) = ... pdf(dist, x) ... or f(dist::ContinuousUnivariateDistribution) = ... pdf(dist, x) ..., actually cannot be written. Assuming one wants the results to be correct, of course :) In particular, integration or MCMC sampling cannot be made to work correctly without dispatching on specific distributions manually. Integral of pdf for a ContinuousUnivariateDistribution isn't always 1 as could be expected by some.

Is this good or bad?

I don't think this is good, but such statements are always subjective.

mschauer commented 8 months ago

If you are interested, can you read https://github.com/JuliaStats/Distributions.jl/issues/1468 which also refers to this discussion: https://github.com/JuliaMath/DensityInterface.jl/pull/4#issuecomment-959970063 and see if that somehow relates to your problem?

aplavin commented 8 months ago

Seems like it is related, but doesn't really help much. If that's indeed the case, that pdf() cannot be used correctly in generic functions for now, should this be stated/warned directly in pdf() docstring? Together with this definition by @mschauer:

pdf(..., x) returns the probability density function or zero except when the probability of getting x exactly is larger than 0 (that is, x is an atom)

(if this is correct)

mschauer commented 8 months ago

It would help if you explain on a concrete example what cannot be done.

aplavin commented 8 months ago

Basically any function that uses Distributions.pdf is incorrect for these "mixed-type" distributions. As a more concrete example, take MCMC with the simplest Metropolis sampling: it won't be able to distinguish between eg censored and truncated distributions, while the correct results are clearly different for them.

Actually, I don't know if it is possible to write any function f(dist) or f(dist::ContinuousUnivariateDistribution) that uses pdf() and computes some nontrivial quantity correctly for all dists...

mschauer commented 8 months ago

So this is wrong:

using Distributions
function mh(target, proposal; init=0.0, samples=100_000)
    x = init
    xs = [x]
    for iter in 2:samples
        xᵒ = rand(proposal(x))
        l = logpdf(target, xᵒ) - logpdf(target, x) +  logpdf(proposal(xᵒ), x) -  logpdf(proposal(x), xᵒ)
        if -rand(Exponential()) < l
            x = xᵒ
        end
        push!(xs, x)
    end
    return xs
end

xs = mh(Normal(3, 1.2), x->Normal(x, 0.1), init = 3.0) # correct
xs = mh(censored(Normal(3, 1.2), 2.0, Inf), x->Normal(x, 0.1), init = 3.0) # incorrect, samples a truncated normal.

How can we prevent this? @devmotion @sethaxen