probability of the topics per document

ValeriiBaidin commented 4 years ago

I wonder, why you don't have vector probability of topics per document?

ericproffitt commented 4 years ago

So I'm not sure I quite follow.

Is the topicdist function not what you're looking for? e.g.

topicdist(model, 1)

will tell you the topic distribution for document 1 in your corpus.

ValeriiBaidin commented 4 years ago

Would you give a reference, how do you compute it, since the result is very confusing. Predictions have to be closed to one or zero, in the next example.


c= TopicModelsVB.Corpus([TopicModelsVB.Document([1,2,3]),TopicModelsVB.Document([1,2,3]),
            TopicModelsVB.Document([4,5,6]),TopicModelsVB.Document([4,5,6])],
            vocab = split("1 2 3 4 5 6 "))

model = LDA(c, 2)
@time train!(model,  tol=0)
topicdist(model, 1)

The TextAnalysis.jl gives the same beta matrix and prediction is reasonable.


using TextAnalysis
@time crps = TextAnalysis.Corpus([StringDocument("one two three"),StringDocument("five six seven"),
                                StringDocument("one two three"),StringDocument(" five six seven  ")])

update_lexicon!(crps)
update_inverse_index!(crps)
m = DocumentTermMatrix(crps)
m.column_indices
beta,pr= lda(m, 2, 100, 0.05, 0.05)

ericproffitt commented 4 years ago

So for the LDA model, the topic distributions are the normalizations of the variational parameters gamma.

To obtain the topic distribution for document d, you take,

model.gamma[d] / sum(model.gamma[d])

In LDA, the gamma variables are paramters for Dirichlet distributions, one gamma for each document d. Thus their normalization is by definition the expected value of the Dirichlet distribution with parameter model.gamma[d].

In your toy example, because you have so few documents, the regularizing influence of the hyparameter model.alpha is very large.

The LDA model initializes alpha = ones(K) by default. Because alpha is a Dirichlet parameter, if you manually initialize it with smaller starting values, this will reduce its regularizing influence, and you should get the desired result, e.g.

using Random

Random.seed!(1);

c = TopicModelsVB.Corpus([TopicModelsVB.Document([1,2,3]), TopicModelsVB.Document([1,2,3]), TopicModelsVB.Document([4,5,6]), TopicModelsVB.Document([4,5,6])], vocab = split("1 2 3 4 5 6 "))

model = LDA(c, 2)
model.alpha = [0.01, 0.01]

@time train!(model, tol=0, check_elbo=Inf)

topicdist(model, 1:4)
## 4-element Array{Array{Float64,1},1}:
## [0.99856061107569, 0.0014393889243100439]
## [0.99856061107569, 0.0014393889243100439]
## [0.0014393889243100439, 0.99856061107569]
## [0.0014393889243100439, 0.99856061107569]

ValeriiBaidin commented 4 years ago

Dirichlet distributions

Thank you so much. I didn't see previously, that I can change alpha.

Thank you!!!

ericproffitt / TopicModelsVB.jl

probability of the topics per document #28