Open dcjones opened 3 years ago
Ok, I understand this a bit more and was able to get correct rand(::Gamma)
gradients using the Figurnov et al technique by adding a custom Zygote adjoint and writing a version of gamma_inc
that works with AD. Then rand(::Beta)
comes for free. No new types required!
I'll make a PR is this sounds like something useful for Turing.
Great! This sounds definitely useful. It would be even better to add an adjoint for ChainRules instead of Zygote (ChainRules is the new way of defining forward and reverse mode rules for different AD backends and is already used by Zygote).
I think a more general ChainRules adjoint may be blocked by https://github.com/JuliaDiff/ChainRulesCore.jl/issues/68. The adjoint is peculiar and relies on running AD on the incomplete gamma function, and it looks like there's not currently a way of doing that without assuming a specific AD system.
So I think it has to be for a specific package, then it can be generalized once ChainRules supports it.
Yeah, I've run into this issue before.
But if it is only since the implementation
relies on running AD on the incomplete gamma function
wouldn't it be even better to add the adjoint for the incomplete gamma function to https://github.com/JuliaDiff/ChainRules.jl/blob/master/src/rulesets/packages/SpecialFunctions.jl instead of relying on a specific AD backend?
Well rand(::Gamma)
doesn't come automatically from gamma_inc
. The trick is pretty simple (code below). The hard part is that SpecialFunctions.gamma_inc
mutates arrays an doesn't work with AD, so I implemented a (probably somewhat inferior) algorithm in _gamma_inc_lower
that does.
I'm just learning this stuff, so I'm very open to a better way of handling this.
ZygoteRules.@adjoint function Distributions.rand(rng::AbstractRNG, d::Gamma{T}) where {T<:Real}
z = rand(rng, d)
function rand_gamma_pullback(c)
y = z/d.θ
∂α, ∂y = gradient(_gamma_inc_lower, d.α, y)
return (
DoesNotExist(),
(α=(-d.θ*∂α/∂y)*c,
θ=y*c))
end
return z, rand_gamma_pullback
end
I don't know if this is exactly the same issue; I was trying to use autodiff in an optimizer that uses an objective function that uses the Gamma distribution, but it chokes at gamma_inc
:
ERROR: MethodError: no method matching _gamma_inc(::ForwardDiff.Dual{ForwardDiff.Tag{var"#11#12", Float64}, Float64, …}, ::ForwardDiff.Dual{ForwardDiff.Tag{var"#11#12", Float64}, Float64, …}, ::Int64
Stacktrace:
[1] gamma_inc(a::ForwardDiff.Dual{ForwardDiff.Tag{var"#11#12", Float64}, Float64, 7}, x::ForwardDiff.Dual{ForwardDiff.Tag{var"#11#12", Float64}, Float64, 7}, ind::Int64) (repeats 2 times)
@ SpecialFunctions C:\Users\nicho\.julia\packages\SpecialFunctions\CQMHW\src\gamma_inc.jl:858
[2] gammacdf(k::ForwardDiff.Dual{ForwardDiff.Tag{var"#11#12", Float64}, Float64, 7}, θ::ForwardDiff.Dual{ForwardDiff.Tag{var"#11#12", Float64}, Float64, 7}, x::ForwardDiff.Dual{ForwardDiff.Tag{var"#11#12", Float64}, Float64, 7})
@ StatsFuns C:\Users\nicho\.julia\packages\StatsFuns\6HmgG\src\distrs\gamma.jl:34
Or is this not expected to work at all?
It seems this is caused by a call of cdf(Gamma(...), ...)
or something similar? Such calls are forwarded to gammacdf
in StatsFuns. In StatsFuns >= 1.0.0 we use Julia implementations instead of Rmath implementations there, which call SpecialFunctions.gamma_inc
. However, there's no method implemented for ForwardDiff.Dual
numbers yet, it would require to fix https://github.com/JuliaDiff/ForwardDiff.jl/issues/424, as outlined in https://github.com/JuliaDiff/ForwardDiff.jl/issues/424#issuecomment-558627378#issuecomment-558627378 (similar to https://github.com/JuliaDiff/ForwardDiff.jl/pull/585).
Trying to compute gradients of the
rand
function wrt to parameters for certain distributions will produce incorrect results, because some of these functions use branching or iterated algorithms and AD can't take into account how the parameters affect control flow.A simple demonstration of this is just trying to estimate d/dθ E[x] by estimating E[d/dθ x].
Normal
of course works: d/dμ E[x] = d/dμ μ = 1 and(which works for any values of μ, σ)
Gamma
will not return a gradient for some values, and return incorrect results for others. E.g. d/dα E[x] = d/dα αβ = β, yetBeta
similarly d/dα E[x] = d/dα α/(α+β) = β / (α+β)^2 yetIt's well known that some distributions (e.g. Gamma, Beta, Dirichlet) don't lend themselves easily to this kind of pathwise gradient which makes them infrequently used as surrogate posteriors for VI, but there have been some papers on trying to work around this using numerical approximations and other techniques. See for example:
I'd love to help improve the
rand
situation, but I'm still getting my bearings with this code, so I was hoping for some pointers.My vague thought was that there might be a
TuringGamma
,TuringBeta
, etc that implement alternativerand
functions that are correctly differentiated. Is there a nicer approach, or is this the best option?Second, for distributions where there is no viable way to AD
rand
, is there something better that can be done than report incorrect numbers? Should the remedy be in Distributions?(Related issue is #113)