more, better, and interactive(?) data viz

bdewilde commented 8 years ago

textacy currently has two visualizations: draw_semantic_network() for visualizing documents as networks of terms with edges given by, say, term co-occurrence; and draw_termite_plot() for visualizing the relationship between topics and terms in a topic model. Both of these could be improved!

There are also tons of other visualizations that textacy users could benefit from:

pyldavis for visualizing various aspects of topic models interactively
word clouds to show word (or, generically, term) counts
word trees to show word sequences
parallel tag clouds to show differences in key terms over time or between groups
stream graph for showing trends over time in, say, topic prevalence or word usage
dependency parsing viz a la displacy
compareclouds for visualizing media frames

I should stop listing these out and just point people to this site, which contains tons of possibilities.

implementation in `textacy`

Python-only, without a bunch of extra dependencies (preferred)
easy interoperability with relevant classes / functions
what else...?

paul-english commented 7 years ago

PyLDAVis is pretty simple w.r.t input. I came up with the following for the prepare method,

model = textacy.tm.TopicModel('lda', n_topics=30)

model.fit(doc_term_matrix)
doc_topic_matrix = model.transform(doc_term_matrix)

top_term_matrix = model.model.components_
doc_lengths = [len(d) for d in documents]
vocab = list(id2term.values())
term_frequency = textacy.vsm.get_term_freqs(doc_term_matrix)

import pyLDAvis

vis_data = pyLDAvis.prepare(
    top_term_matrix,
    doc_topic_matrix,
    doc_lengths,
    vocab,
    term_frequency,
)

One thing, pyldavis does an assertion on the document topic matrix to ensure all rows sum to one. This happens for LDA, but I noticed that NMF didn't do this step, I don't know about LSA.

rebeccabilbro commented 7 years ago

Hello @bdewilde - we've been working on a machine learning visualization library called Yellowbrick, to provide custom Matplotlib visualizers for Scikit-Learn estimators. The project is still young, but is growing, and we've recently added a few new features for visualization to support modeling on text. We're big fans of your work and we think the list of ideas in this issue is very interesting. Not sure if you're still interested in pursuing the text viz stuff or have moved on to other things, but let us know if you have any additional thoughts or suggestions!

chartbeat-labs / textacy