Performance regression for BernoulliLogit

I was just playing around a bit with https://github.com/torfjelde/TuringBenchmarking.jl and noticed a sudden change in the runtime described in the README (the example model is suddenly 16x slower for gradient evaluation for ReverseDiff with compiled mode).

I eventually narrowed it down to #1892 being the cause, i.e. the performance of the following model:

@model function irt(y, i, p; I = maximum(i), P = maximum(p))
    theta ~ filldist(Normal(), P)
    beta ~ filldist(Normal(), I)
    Turing.@addlogprob! sum(logpdf.(BernoulliLogit.(theta[p] - beta[i]), y))

    return (; theta, beta)
end

absolutely tanks for ReverseDiff when we use the implementation of BernoulliLogit from Distributions.jl :confused:

On Turing@0.21.12:

┌ Info: Turing.jl
│   run(suite) =
│    2-element BenchmarkTools.BenchmarkGroup:
│      tags: []
│      "linked" => 3-element BenchmarkTools.BenchmarkGroup:
│         tags: []
│         "evaluation" => Trial(1.333 ms)
│         "Turing.Essential.ReverseDiffAD{true}()" => Trial(1.752 ms)
│         "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(174.759 ms)
│      "not_linked" => 3-element BenchmarkTools.BenchmarkGroup:
│         tags: []
│         "evaluation" => Trial(1.339 ms)
│         "Turing.Essential.ReverseDiffAD{true}()" => Trial(1.796 ms)
└         "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(169.376 ms)

while on Turing@0.21.13

┌ Info: Turing.jl
│   run(suite) =
│    2-element BenchmarkTools.BenchmarkGroup:
│      tags: []
│      "linked" => 3-element BenchmarkTools.BenchmarkGroup:
│         tags: []
│         "evaluation" => Trial(554.568 μs)
│         "Turing.Essential.ReverseDiffAD{true}()" => Trial(16.418 ms)
│         "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(140.508 ms)
│      "not_linked" => 3-element BenchmarkTools.BenchmarkGroup:
│         tags: []
│         "evaluation" => Trial(554.415 μs)
│         "Turing.Essential.ReverseDiffAD{true}()" => Trial(16.445 ms)
└         "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(139.849 ms)

Given that evaluation and ForwardDiff is faster in the latter case, it's clearly an "issue" with ReverseDiff, but at the same time this is such a significant perf hit that it makes me a bit uncomfortable to just "leave it in" there :confused:

Thoughts? @devmotion

TuringLang / Turing.jl

Performance regression for BernoulliLogit #1934