JuliaStats / Distributions.jl

A Julia package for probability distributions and associated functions.
Other
1.1k stars 414 forks source link

Add Generalized chi-squared distribution #1806

Open heliosdrm opened 10 months ago

heliosdrm commented 10 months ago

This extends the collection of univariate continuous distributions with the Generalized chi-squared. All required methods are added, and some of the recommended ones, as indicated in https://juliastats.org/Distributions.jl/stable/extends/#Univariate-Distribution

Some notes:

  1. I started considering Abhranil Das' code for Matlab (MIT-licensed) as reference, although progressively departed from it, so that eventually there is little in the Julia code that can be attributed to him. I have kept a comment where I mention the inspiration to define the cdf function, using Davies' algorithm to integrate the characteristic function. But I don't know if a more explicit attribution in the code or in the license should be made.
  2. There are several auxiliary functions that should not be part of the API, and I have encapsulated all of them in the module GChisqComputations, after the example of the Chernoff distribution.
  3. That module includes two functions to calculate the characteristic function: cf_explicit that implements the explicit formula, and cf_inherit that composes the results of cf(::Normal) and cf(::NoncentralChisq). Both are equivalent, and the code of the second is clearer, but cf(::GeneralizedChisq) calls the first one, because in a few tests that I have done, it is slightly (but not much) faster.
  4. cdf and pdf are calculated by integration, using the Gil-Pelaez theorem (the algorithm for the CDF is given in Davies' paper, and the algorithm for the PDF has been indirectly derived from it). The integration is done using QuadGK.jl, which was already in the dependencies of Distributions.jl. The default tolerances of QuadGK.quadgk made the calculation of cdf at the median very slow, because there the integral is zero. Since in the ends of the distribution the integral is ±π/2, which is in the order of magnitude of the unit, I have used a fixed absolute tolerance of eps(one(T)) for the integration in the calculation of the CDF. I have not, however, changed the defaults for the calculation of the PDF.
  5. quantile is calculated using Distributions.quantile_newton. However, there is no analytic solution to find the mode, which is the recommended starting point, so this is searched using the following strategy: (1) start at the mean of the distribution, and use it if it meets the convergence criteria; (2) otherwise define a bracket between that point and another at one standard deviation from the mean, in the direction towards the target probability (and if the bracket does not contain the target probability, extend it until it does); (3) bisect the bracket until one of the ends meets the convergence criteria, and then use that as the starting point. This seems to work fine, but it's a self-made algorithm, and perhaps there is a more efficient one that might be used instead.
codecov-commenter commented 10 months ago

Codecov Report

Attention: Patch coverage is 96.71053% with 5 lines in your changes missing coverage. Please review.

Project coverage is 86.19%. Comparing base (7e232ca) to head (3ab5b06). Report is 34 commits behind head on master.

Files Patch % Lines
src/univariate/continuous/generalizedchisq.jl 96.71% 5 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1806 +/- ## ========================================== + Coverage 86.00% 86.19% +0.18% ========================================== Files 144 145 +1 Lines 8646 8798 +152 ========================================== + Hits 7436 7583 +147 - Misses 1210 1215 +5 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mahaa2 commented 6 months ago

When this will be added to the main branch ?

mahaa2 commented 6 months ago

Hi,

I have played around with this extension. As a Statistician, I strongly believe it is a very useful and very important distribution to have in main branch of the package Distribution.jl.

In general it seems to work very well and I will use for future research.

However, I notice that while doing my own tests the quadgk.jl package/function used inside the file generalizedchisq.jl appears to fail accusing the error (line 260 of that file)

" DomainError with 0.9999999999999964: integrand produced NaN in the interval (0.9999999999999929, 1.0) " ,

in the following explained cases.

Lets say that X ~ N_(µ, Σ) with particular vector dimension D = 2, 3, 4. Then we now know that Y = q0 + q1'X + X'QX ~GChisq(w, ν, λ, m, s) (the notation in the GChisq package is different, but also irrelevant here).

The parameter values (w, ν, λ, m, s) = T(q0, q1, Q, µ, Σ) are obtained accordingly with the transformation, say T, following the paper (I present below my own code to replicate the experiments in Julia).

One can just run the below code and check the results. The error also seems to be exactly the same one appearing in here.


mutable struct parGChisq
    w::Vector{<: Real}
    ν::Vector{<: Real}
    λ::Vector{<: Real}
    m::Real
    s::Real
end

function mapGChisqpar(
    q0::Real,
    q1::Vector{<: Real},
    Q ::AbstractMatrix,
    µ ::Vector{<: Real},
    Σ ::AbstractMatrix,
    ch::Bool = false)::parGChisq

    """
     if   X ~ N(µ, Σ)
     then Υ = q0 + q1' X + X'QX ~ GChisq(w, ν, λ, m, s)

     see https://arxiv.org/pdf/2012.14331.pdf

     the option with ch = true
     does not work. why ? 
    """ 

    # check dimensionality match
    dimQ = size(Q, 1)
    dimq = length(q1)
    dimΣ = size(Σ, 1)
    dimµ = length(µ)

    dimQ !== dimq ? @error("dims mismatch Q, q1") : nothing
    dimΣ !== dimq ? @error("dims mismatch Σ, q1") : nothing
    dimΣ !== dimµ ? @error("dims mismatch Σ, µ") : nothing

    # check matrix conditions
    !issymmetric(Q) ? @error("Q is not symmetric") : nothing
    !isposdef(Σ)    ? @error("Σ is not PD") : nothing

    # take the square root matrix 
    S = !ch ? sqrt(Σ) : cholesky(Σ).L

    # standartize the space
    t_Q² = !ch ? Symmetric(S * Q * S) : Symmetric(S * Q * S')
    t_q1 = S * (2.0 * Q * µ + q1)
    t_q0 = q0 + dot(q1, µ) + dot(µ, Q, µ) 

    # get the generalized Chi-squared parameters
    d, R = eigen(t_Q²)
    d .= trunc.(d, digits = 9)
    # d[abs.(d) .< eps()] .= 0.0

    # compute b terms
    b = R' * t_q1

    # unique nonzeros eigenvalues
    # the unique function returns an ordered set
    iz = d .!= 0.0
    w  = unique(d[iz])
    ew = !isempty(w) 

    # warn if there are only zero eigenvalues
    ew ? nothing : @warn("Y will follow a Gaussian distribution") 

    # total dof of each nonzero eigenvalue
    ν = ew ? [sum(d .== i) for i in w] : Float64[]

    # total non-centrality for each nonzero eigenvalue
    λ = ew ? [sum(b[d .== x].^2.0) for x in w] ./ (4.0.*w.^2.0) : Float64[]

    # Gaussian parameters
    m = t_q0 - dot(w, λ)
    s = norm(b[.!iz]) 

    return parGChisq(w, ν, λ, m, s)
end

using Distributions, 
      LinearAlgebra,
      Plots

# tests
# dimension
D = 3 # 2 or 4

# set parameter of the const + linear + quadratic form
q0 = randn()
q1 = rand(D)
Q  = Diagonal(randn(D))
µ  = randn(D)
Σ  = rand(Wishart(D, Matrix(I, D, D)), 1)[];

# set the correspondent parameters of the GChisq
th = mapGChisqpar(q0, q1, Q, µ, Σ)

# set the GChisq distribution
π = GeneralizedChisq(th.w, th.ν, th.λ, th.m, th.s)

# test
@time pdf(π, mean(π))