Very strange result - Githubissues

ericproffitt / TopicModelsVB.jl

A Julia package for variational Bayesian topic modeling.

Other

81 stars 8 forks source link

Very strange result #22

Closed ValeriiBaidin closed 4 years ago

ValeriiBaidin commented 4 years ago

I am sorry to bother you, I've just check very simple an example. The result seems strange.

c= Corpus([Document([1,2,3]),Document([1,2,3]),Document([1,2,3]),Document([1,2,3]),
            Document([4,5,6]),Document([4,5,6]),Document([4,5,6]),Document([4,5,6]),
            Document([7,7,7]),Document([7,7,7]),Document([7,7,7]),Document([7,7,7])],
            vocab = split("1 2 3 4 5 6 7"))
model = LDA(c, 2)
@time train!(model,  tol=0)
showtopics(model, cols=2, 3)

The Result is

topic 1    topic 2
3          7
2          6
1          5

It is strange, that topic 2 doesn't contain 4 and contains 7.

Would check, is it correct.

thank you in advance.

P.S. Have you compare your results with other realizations.

P.S.S. Thank you so much for your code.

ericproffitt commented 4 years ago

Hi Valerii,

So if I understand your question correctly, you're asking why the topics don't contain all the numbers.

Each topic contains a ranking of all the terms in the vocabulary. The code showtopics(model, cols=2, 3) means that you are only showing the top three terms for each topic. If you would like to view the full term ranking for each topic, then you can write,

showtopics(model, cols=2, 7)

As for comparisons with other topic modeling packages, I have not, however the implementations are standard coordinate-ascent variational inference. Original algorithms may be found in the bibliography.

ValeriiBaidin commented 4 years ago

Hi Valerii,

So if I understand your question correctly, you're asking why the topics don't contain all the numbers.

Each topic contains a ranking of all the terms in the vocabulary. The code showtopics(model, cols=2, 3) means that you are only showing the top three terms for each topic. If you would like to view the full term ranking for each topic, then you can write,
showtopics(model, cols=2, 7)
As for comparisons with other topic modeling packages, I have not, however the implementations are standard coordinate-ascent variational inference. Original algorithms may be found in the bibliography.

From the data, there is 2 topics: (1,2,3) and (4,5,6). I don't understand, why topics 2 is (7,6,5)

ericproffitt commented 4 years ago

Ah I see what you're asking. So strictly speaking your corpus has three topics, not two: (1,2,3), (4,5,6) and (7).

So topic 2 ends up having to merge the (4,5,6) and (7) topics together, and (7) ends up above (4,5,6), probably because 7 occurs with the most frequency.

If you try setting model=LDA(c, 3), you may obtain more sensible results. However even with three topics, depending on how the topic weights are randomly initialized, the algorithm may get trapped in a poor local optima.