chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Termite spectral sort #295

Closed rtbs-dev closed 4 years ago

rtbs-dev commented 4 years ago

Adds a data-frame-based function to create a "Termite Plot" using spectral seriation as outlined in the original paper.

Description

Spectral seriation added as a term-ordering technique, matching the paper by Chuang et al. Doing this required some aggregation and filtering logic that was much more succinct via Pandas.

New function termite_df_plot is essentially a wrapper around the original draw_termite_plot that offloads a lot of the logic for aggregation and sorting to Pandas, since it requires a dataframe as input.

Seriation

Assuming "seriation" is passed as the term-sorting option, the "magic" is to use the feidler vector as an ordering, directly on the top-ranked terms (as determined by the rank_terms_by option). Given a filtered doc-topic component_filter dataframe, this looks like:

# calculate similarity matrix
similarity = (
        component_filter@component_filter.T
        .pipe(lambda df: df-df.min().min())
).values
# compute Laplacian matrice and its 2nd eigenvector
L = np.diag(similarity.sum(axis=1)) - similarity
D, V = np.linalg.eigh(L)
D = D[np.argsort(D)]
V = V[:, np.argsort(D)]
fiedler = V[:, 1]

# get permutation corresponding to sorting the 2nd eigenvector
component_filter=component_filter.reindex(
    index=[
        component_filter.index[i]
        for i in np.argsort(fiedler)
    ],
)

This is an excerpt from the new function.

Aesthetics

Minor changes to the original draw_termite_plot set defaults that are amenable to two-column academic papers (e.g. avoiding overhang with column labels leaning left instead of right).

Motivation and Context

This was written as part of a paper to provide topic model visualizations to an engineering community. Textacy seemed to be one of the only modern libraries to provide easy access to termite plots (which have proven incredibly useful to explain topic models), but the current implementation did not include the key seriation technique that really makes them powerful.

Was waiting to finish the manuscript acceptance process before submitting this code.

How Has This Been Tested?

In production of this paper, the code was tested across multiple iterations for figures.

I do not see a test_viz.py, so the pytest suite was skipped. Almost no original functionality was altered.

Screenshots (if appropriate):

Example output using seriation.

termite.pdf

Types of changes

Checklist:

bdewilde commented 4 years ago

Hi @tbsexton , thank you tons for the PR! This is an old part of the code base that I've not given thought to in years; as such, it's going to take me a bit to dig back into it. Apologies in advance for the belated review. 😌

rtbs-dev commented 4 years ago

No worries! I noticed some of the data interfaces were a bit out of sync with the other structures in the package. I'm not tied to dataframes, but it seemed to match well with standard viz practice in other places (i.e. seaborn, holoviews, etc.) and honestly made it easier to work with.

Happy to discuss other ways we might implement the spectral sort (i.e. not a wrapper function)...this was just the first pass at something that wasn't going to cause backward-incompatibilities for you!

bdewilde commented 4 years ago

Oh, one other question: How much work would it be to write a couple basic tests? It would be great to know if I accidentally break something when I start mucking around in this part of the code base again...

rtbs-dev commented 4 years ago

@bdewilde re:tests I should be able to. We might have to iterate a bit to scope the testing correctly, and I don't have loads of free time atm. But will get around to an initial pass!

bdewilde commented 4 years ago

Hi @tbsexton , I'm going to merge this in. I've been neglecting textacy while waiting on some advances in thinc and spacy, but I should probably stop dragging my feet. :) Thanks again for the PR!

rtbs-dev commented 4 years ago

my pleasure! I'll think a bit about possible ways to clean up some viz stuff.... honestly given the scope of textacy, integrating more with something model-based like holoviews could make more advanced viz with modularity easier to design around in the long term.