Closed ValeriiBaidin closed 4 years ago
So I'm not sure I quite follow.
Is the topicdist
function not what you're looking for? e.g.
topicdist(model, 1)
will tell you the topic distribution for document 1 in your corpus.
Would you give a reference, how do you compute it, since the result is very confusing. Predictions have to be closed to one or zero, in the next example.
c= TopicModelsVB.Corpus([TopicModelsVB.Document([1,2,3]),TopicModelsVB.Document([1,2,3]),
TopicModelsVB.Document([4,5,6]),TopicModelsVB.Document([4,5,6])],
vocab = split("1 2 3 4 5 6 "))
model = LDA(c, 2)
@time train!(model, tol=0)
topicdist(model, 1)
The TextAnalysis.jl gives the same beta matrix and prediction is reasonable.
using TextAnalysis
@time crps = TextAnalysis.Corpus([StringDocument("one two three"),StringDocument("five six seven"),
StringDocument("one two three"),StringDocument(" five six seven ")])
update_lexicon!(crps)
update_inverse_index!(crps)
m = DocumentTermMatrix(crps)
m.column_indices
beta,pr= lda(m, 2, 100, 0.05, 0.05)
So for the LDA model, the topic distributions are the normalizations of the variational parameters gamma
.
To obtain the topic distribution for document d
, you take,
model.gamma[d] / sum(model.gamma[d])
In LDA, the gamma
variables are paramters for Dirichlet distributions, one gamma
for each document d
. Thus their normalization is by definition the expected value of the Dirichlet distribution with parameter model.gamma[d]
.
In your toy example, because you have so few documents, the regularizing influence of the hyparameter model.alpha
is very large.
The LDA model initializes alpha = ones(K)
by default. Because alpha
is a Dirichlet parameter, if you manually initialize it with smaller starting values, this will reduce its regularizing influence, and you should get the desired result, e.g.
using Random
Random.seed!(1);
c = TopicModelsVB.Corpus([TopicModelsVB.Document([1,2,3]), TopicModelsVB.Document([1,2,3]), TopicModelsVB.Document([4,5,6]), TopicModelsVB.Document([4,5,6])], vocab = split("1 2 3 4 5 6 "))
model = LDA(c, 2)
model.alpha = [0.01, 0.01]
@time train!(model, tol=0, check_elbo=Inf)
topicdist(model, 1:4)
## 4-element Array{Array{Float64,1},1}:
## [0.99856061107569, 0.0014393889243100439]
## [0.99856061107569, 0.0014393889243100439]
## [0.0014393889243100439, 0.99856061107569]
## [0.0014393889243100439, 0.99856061107569]
Dirichlet distributions
Thank you so much. I didn't see previously, that I can change alpha.
Thank you!!!
I wonder, why you don't have vector probability of topics per document?