bmabey / pyLDAvis

Python library for interactive topic model visualization. Port of the R LDAvis package.
BSD 3-Clause "New" or "Revised" License
1.81k stars 363 forks source link

Marginal topic distributions as table or print #110

Open renswilderom opened 7 years ago

renswilderom commented 7 years ago

I'm a very satisfied user of the pyLDAvis framework, but I have two questions regarding its use. First, I wonder how I can get the marginal topic distributions in the form of a table or print (rather than just plotted in the visualization). I also opened a Stackoverflow item about this question: https://stackoverflow.com/questions/47584139/scikit-learn-ldavis-retrieve-marginal-topic-distribution-or-compute-comparable

Secondly, I wonder if it is possible to associate the marginal topic distribution to topics as they are ordered in the convential scikit-learn system. So when printed for example with the code below. Or can I order these topics in such a way that they match with the topic order in LDAvis?

n_top_words = 30

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

tf_feature_names = tf_vectorizer_original.get_feature_names() 
print_top_words(lda_tf_original, tf_feature_names, n_top_words)

I hope the issue section of this GitHub page is the right place to post this. Many thanks in advance for your reply.

abhiramnarla commented 6 years ago

Have you figured it out?

vikasFid commented 6 years ago

Any one figured this out?

renswilderom commented 6 years ago

No, unfortunately not. Coincidently, I found that based on the document-topic matrix, you can compute the mean proportion of each document belonging to a topic (see code below). The output of this table shows how prevalent each topic is. In addition, I compared the ranking of LDAvis and this 'mean proportion' table, and they are roughly similar, though not identical.

Here is the code:

df_1 = df.describe().loc[['mean']]
df2 = df_1.transpose()

And here is the table (where topic 0 is the largest topic, topic 8 the second largest, and so on):

image

Let me know if you have problems with computing the document-topic matrix based on the original LDA model (I can give the code for that too).

vikasFid commented 6 years ago

Thank you for this! I have a follow up question as well regarding term-topic distribution. From the pyLDAVis visualization hovering over a term gives the list of topics where the term is significant, I want to use this list of topics for further analysis. I am not sure how to get this though, please help!

renswilderom commented 6 years ago

I'm not sure if I understand you well, but the code above in my original question gives you the list of topics, as shown in the LDAvis model. The content of the topics is the same, only the order of the topics (the topic numbers) do not correspond, but this doesn't give large problems.

From this list of topics you could also extract which terms are present in which topics.

vikasFid commented 6 years ago

I was able to resolve the topic order issue posted in your original problem using new.order = RJSONIO::fromJSON(json)$topic.order and then ordered the LDA model's Beta and Gamma values as per the new topic order. Using this sorting I was able to get the exact order of terms for a given lambda as displayed in pyLDAVis. As a next step, I wanted to know the dominant topics for a given term. In the pyLDAVis visualization if you hover over a term, it highlights some topics(area of bubble according to some metric) I wanted to know how that is calculated and how can I get that list of topics for a given term.

renswilderom commented 6 years ago

Yeah, I see what you want. I also don't know how this calculation is made. I assume there should be a table somewhere showing how much each term contributes to a topic. Perhaps you find this in the sci-kit learn documentation.

Thanks btw for the topic order solution! I will check if it also works for me

venky2k11 commented 3 years ago

I tried with below code and it seems to be giving me the frequency which matches the bubble size of topic

pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False, mds='mmds').to_dict()['mdsDat']['Freq']

renswilderom commented 3 years ago

@venky2k11 Thanks for this! I gonna check it out, as soon as I run a topic model again. Will also let you know how it worked for me.