Closed JHonaker closed 2 years ago
I just recently started trying to get an LDA model on Turing for a tutorial, and at one point the sampler was quoting me a two day sampling time for a fairly small corpus. @xukai92 is more familiar with this kind of stuff, though --- he might be able to speak better to speed.
Could you provide your sampler call as well? It may help us to diagnose whether there's some sampler settings that can be adjusted.
Sorry, should have added that. It's edited.
In my previous experiments, using an uncollapsed LDA implementation is indeed slow. There are a few tricks you can use to make it faster
using the vectorised syntax for phi and theta inside the model
theta = Matrix{Real}(K, M)
theta ~ [Dirichlet(alpha)]
phi = Matrix{Real}(V, K)
phi ~ [Dirichlet(beta)]
z
and increase the log-joint by some matrix operations. I'm pasting my previous code blow:
phi_dot_theta = log.(phi * theta)
logjoint += mapreduce(n->phi_dot_theta[w[n], doc[n]], +, 1:N)
vi
or provide an interface to increase the log-joint in the model, because there are many tricks user can do to manipulate the log-joint to get a huge performance gain.Other side notes:
K
, D
and V
but not N
. As native Julia codes are very fast the evaluate the likelihood.@mohamed82008 do you think the performance gain we obtained before is because ReverseDiff somehow compiles the code and makes it type stable?
Last I checked, ReverseDiff and Zygote were both type stable, Flux AD was not. But I did not benchmark them against each other.
@mohamed82008 I think we should either let users get access to vi or provide an interface to increase the log-joint in the model,
I am OK with the first. There is nothing wrong with reserved words, we can make __vi__
as one and use that for the VarInfo
. @yebai and @willtebbutt any objections?
So for a collapsed version, if theta is DxK and phi is KxV wouldn’t this be the same as doing the log likelihood trick:
word_dist = theta * phi
for i in 1:N
w[i] ~ Categorical(word_dist[doc_idx[i]])
end
Yes this trick certainly works, with a bit tweak, due to the fact that Categorical in Distributions doesn't support AD yet. See a complete example below
using Turing
struct CategoricalReal<: Distributions.DiscreteUnivariateDistribution
p
end
function Distributions.logpdf(d::CategoricalReal, k::Int64)
return log(d.p[k])
end
@model lda_collapsed(K, V, M, N, w, doc, beta, alpha) = begin
theta = Matrix{Real}(undef, K, M)
for m = 1:M
theta[:,m] ~ Dirichlet(alpha)
end
phi = Matrix{Real}(undef, V, K)
for k = 1:K
phi[:,k] ~ Dirichlet(beta)
end
word_dist = phi * theta
for n = 1:N
w[n] ~ CategoricalReal(word_dist[:,doc[n]])
end
end
NOTE: 200 samples from HMC(200, 0.01, 3)
with ForwardDiff.jl took ~90s; with new AdvancedHMC.jl it's ~50s. We'd fix Flux.Tracker first so that we can use reverse mode, which I believe will give the most of performance gain.
Maybe re-run the model on the master branch?
cc @trappmartin @htsadiq
Just check the inference time with the same hyper-parameter in the original question (D = 2, K = 2, V = 160, N = 290
) and the current Turing (v0.21.5
) is much faster than before.
I'm putting a minimal example below to track the performance.
using Distributions, Turing
Turing.setadbackend(:tracker)
function make_data(D, K, V, N, α, η)
β = Matrix{Float64}(undef, V, K)
for k in 1:K
β[:,k] .= rand(Dirichlet(η))
end
θ = Matrix{Float64}(undef, K, D)
z = Vector{Int}(undef, D * N)
w = Vector{Int}(undef, D * N)
doc = Vector{Int}(undef, D * N)
i = 0
for d in 1:D
θ[:,d] .= rand(Dirichlet(α))
for n in 1:N
i += 1
z[i] = rand(Categorical(θ[:,d]))
w[i] = rand(Categorical(β[:,z[i]]))
doc[i] = d
end
end
return (D=D, K=K, V=V, N=N, α=α, η=η, z=z, w=w, doc=doc, θ=θ, β=β)
end
data = let D = 2, K = 2, V = 160, N = 290
make_data(D, K, V, N, ones(K), ones(V))
end
# LDA with vectorization and manual log-density accumulation
@model function LatentDirichletAllocationVectorizedCollapsedMannual(
D, K, V, α, η, w, doc
)
β ~ filldist(Dirichlet(η), K)
θ ~ filldist(Dirichlet(α), D)
log_product = log.(β * θ)
Turing.@addlogprob! sum(log_product[CartesianIndex.(w, doc)])
# Above is equivalent to below
#product = β * θ
#dist = [Categorical(product[:,i]) for i in 1:D]
#w ~ arraydist([dist[doc[i]] for i in 1:length(doc)])
end
lda = LatentDirichletAllocationVectorizedCollapsedMannual(data.D, data.K, data.V, data.α, data.η, data.w, data.doc)
@time chain = sample(lda, HMC(0.05, 10), 2_000) # => 57s on my laptop
which finishes within 1 minutes on my laptop.
We may want to track this somewhere (and somehow automatically) but I think this issue can be closed.
Maybe create a benchmarking page on the new website? LDA is a very good case to include I think.
I'm very impressed by the speed increase that's happened over the last three years. I'd forgotten about this, but the fact that the un-collapsed LDA sample works so well now is a testament to all the great work you all are doing!
I've been trying to write out an LDA model using Turing, but sampling is incredibly slow. I was only trying to use one of the 20 Newsgroups dataset. I pared it down to 2 topics and 2 documents, just to see if it would sample at a reasonable rate, but it's still incredibly slow.
Can anyone take a look at this, and point me at something I'm doing incorrectly?
A document is transformed from a list of tokens into two vectors: w and d like so: d1 = [1, 2, 3] d2 = [2, 1, 1]
w = [1, 2, 3, 2, 1, 1] d = [1, 1, 1, 2, 2, 2]