LDA Model incredibly slow, even in low dimension

JHonaker commented 5 years ago

I've been trying to write out an LDA model using Turing, but sampling is incredibly slow. I was only trying to use one of the 20 Newsgroups dataset. I pared it down to 2 topics and 2 documents, just to see if it would sample at a reasonable rate, but it's still incredibly slow.

Can anyone take a look at this, and point me at something I'm doing incorrectly?

K: 2, number of topics
D: 2, number of documents
V: 160, number of tokens in the vocabulary
N: 290, number of tokens in all documents (i.e. length of data)
w: Vector{int}, token index of the word in the vocabulary
d: Vector{int}, document index of the word

A document is transformed from a list of tokens into two vectors: w and d like so: d1 = [1, 2, 3] d2 = [2, 1, 1]

w = [1, 2, 3, 2, 1, 1] d = [1, 1, 1, 2, 2, 2]

K = 2
V = length(unique(word_labels))
D = length(unique(doc_labels))
N = length(word_labels)

@model lda(d=Vector{Int}(undef, N), w=Vector{Int}(undef, N)) = begin
    ϕ = Vector{Vector{Real}}(undef, K)
    for i in 1:K
        ϕ[i] ~ Dirichlet(ones(V) / V)
    end

    θ = Vector{Vector{Real}}(undef, D)
    for i in 1:D
        θ[i] ~ Dirichlet(ones(K) / K)
    end

    z = tzeros(Int, N)

    for i in 1:N
        z[i] ~ Categorical(θ[d[i]])
        w[i] ~ Categorical(ϕ[d[i]])
    end
end

g = Gibbs(1000, HMC(1, 0.2, 3, :ϕ, :θ), PG(1, 1, :z));
chains = sample(lda(doc_labels, word_labels), g)

cpfiffer commented 5 years ago

I just recently started trying to get an LDA model on Turing for a tutorial, and at one point the sampler was quoting me a two day sampling time for a fairly small corpus. @xukai92 is more familiar with this kind of stuff, though --- he might be able to speak better to speed.

Could you provide your sampler call as well? It may help us to diagnose whether there's some sampler settings that can be adjusted.

JHonaker commented 5 years ago

Sorry, should have added that. It's edited.

xukai92 commented 5 years ago

In my previous experiments, using an uncollapsed LDA implementation is indeed slow. There are a few tricks you can use to make it faster

using the vectorised syntax for phi and theta inside the model

theta = Matrix{Real}(K, M)
theta ~ [Dirichlet(alpha)]

phi = Matrix{Real}(V, K)
phi ~ [Dirichlet(beta)]

implement a collapsed version
- You can marginlise out z and increase the log-joint by some matrix operations. I'm pasting my previous code blow:
```
phi_dot_theta = log.(phi * theta)
logjoint += mapreduce(n->phi_dot_theta[w[n], doc[n]], +, 1:N)
```
- However, with the Turing's new compiler. This is currently not possible as we don't provide an interface to manipulate the log-joint to users any more. @mohamed82008 I think we should either let users get access to vi or provide an interface to increase the log-joint in the model, because there are many tricks user can do to manipulate the log-joint to get a huge performance gain.

Other side notes:

In Turing (or Julia?), the performance is more sensitive to K, D and V but not N. As native Julia codes are very fast the evaluate the likelihood.
We previously obtained very good performance on LDA that is better than Stan. However, at that time Turing's AD is based on ReverseDiff.jl, which has some code optimisation leads to a good performance gain (https://github.com/JuliaDiff/ReverseDiff.jl/issues/102). But we now switch to Flux.jl as our AD backend so I don't think we are still that fast. @mohamed82008 do you think the performance gain we obtained before is because ReverseDiff somehow compiles the code and makes it type stable?

mohamed82008 commented 5 years ago

@mohamed82008 do you think the performance gain we obtained before is because ReverseDiff somehow compiles the code and makes it type stable?

Last I checked, ReverseDiff and Zygote were both type stable, Flux AD was not. But I did not benchmark them against each other.

@mohamed82008 I think we should either let users get access to vi or provide an interface to increase the log-joint in the model,

I am OK with the first. There is nothing wrong with reserved words, we can make __vi__ as one and use that for the VarInfo. @yebai and @willtebbutt any objections?

JHonaker commented 5 years ago

So for a collapsed version, if theta is DxK and phi is KxV wouldn’t this be the same as doing the log likelihood trick:

word_dist = theta * phi
for i in 1:N
w[i] ~ Categorical(word_dist[doc_idx[i]])
end

xukai92 commented 5 years ago

Yes this trick certainly works, with a bit tweak, due to the fact that Categorical in Distributions doesn't support AD yet. See a complete example below

using Turing

struct CategoricalReal<: Distributions.DiscreteUnivariateDistribution
    p
end

function Distributions.logpdf(d::CategoricalReal, k::Int64)
    return log(d.p[k])
end

@model lda_collapsed(K, V, M, N, w, doc, beta, alpha) = begin
    theta = Matrix{Real}(undef, K, M)
    for m = 1:M
        theta[:,m] ~ Dirichlet(alpha)
    end

    phi = Matrix{Real}(undef, V, K)
    for k = 1:K
        phi[:,k] ~ Dirichlet(beta)
    end

    word_dist = phi * theta

    for n = 1:N
        w[n] ~ CategoricalReal(word_dist[:,doc[n]])
    end
end

NOTE: 200 samples from HMC(200, 0.01, 3) with ForwardDiff.jl took ~90s; with new AdvancedHMC.jl it's ~50s. We'd fix Flux.Tracker first so that we can use reverse mode, which I believe will give the most of performance gain.

yebai commented 5 years ago

Maybe re-run the model on the master branch?

cc @trappmartin @htsadiq

xukai92 commented 2 years ago

Just check the inference time with the same hyper-parameter in the original question (D = 2, K = 2, V = 160, N = 290) and the current Turing (v0.21.5) is much faster than before.

I'm putting a minimal example below to track the performance.

using Distributions, Turing
Turing.setadbackend(:tracker)

function make_data(D, K, V, N, α, η)
    β = Matrix{Float64}(undef, V, K)
    for k in 1:K
        β[:,k] .= rand(Dirichlet(η))
    end

    θ = Matrix{Float64}(undef, K, D)
    z = Vector{Int}(undef, D * N)
    w = Vector{Int}(undef, D * N)
    doc = Vector{Int}(undef, D * N)
    i = 0
    for d in 1:D
        θ[:,d] .= rand(Dirichlet(α))
        for n in 1:N
            i += 1
            z[i] = rand(Categorical(θ[:,d]))
            w[i] = rand(Categorical(β[:,z[i]]))
            doc[i] = d
        end
    end
    return (D=D, K=K, V=V, N=N, α=α, η=η, z=z, w=w, doc=doc, θ=θ, β=β)
end

data = let D = 2, K = 2, V = 160, N = 290
    make_data(D, K, V, N, ones(K), ones(V))
end

# LDA with vectorization and manual log-density accumulation
@model function LatentDirichletAllocationVectorizedCollapsedMannual(
    D, K, V, α, η, w, doc
)
    β ~ filldist(Dirichlet(η), K)
    θ ~ filldist(Dirichlet(α), D)

    log_product = log.(β * θ)
    Turing.@addlogprob! sum(log_product[CartesianIndex.(w, doc)])
    # Above is equivalent to below
    #product = β * θ
    #dist = [Categorical(product[:,i]) for i in 1:D]    
    #w ~ arraydist([dist[doc[i]] for i in 1:length(doc)])
end

lda = LatentDirichletAllocationVectorizedCollapsedMannual(data.D, data.K, data.V, data.α, data.η, data.w, data.doc)

@time chain = sample(lda, HMC(0.05, 10), 2_000) # => 57s on my laptop

which finishes within 1 minutes on my laptop.

We may want to track this somewhere (and somehow automatically) but I think this issue can be closed.

yebai commented 2 years ago

Maybe create a benchmarking page on the new website? LDA is a very good case to include I think.

JHonaker commented 2 years ago

I'm very impressed by the speed increase that's happened over the last three years. I'd forgotten about this, but the fact that the un-collapsed LDA sample works so well now is a testament to all the great work you all are doing!

TuringLang / Turing.jl

LDA Model incredibly slow, even in low dimension #668