davidmcclure / textplot

(Mental) maps of texts with kernel density estimation and force-directed networks.
MIT License
106 stars 35 forks source link

Textplot

War and Peace (click to zoom)

War and Peace

Textplot is a little program that converts a document into a network of terms, with the goal of teasing out information about the high-level topic structure of the text. For each term:

  1. Get the set of offsets in the document where the term appears.

  2. Using kernel density estimation, compute a probability density function (PDF) that represents the word's distribution across the document. Eg, from War and Peace:

    War and Peace

  3. Compute a Bray-Curtis dissimilarity between the term's PDF and the PDFs of all other terms in the document. This measures the extent to which two words appear in the same locations.

  4. Sort this list in descending order to get a custom "topic" for the term. Skim off the top N words (usually 10-20) to get the strongest links. Here's "napoleon":

    [('napoleon', 1.0),
    ('war', 0.65319871313854128),
    ('military', 0.64782349297012154),
    ('men', 0.63958189887106576),
    ('order', 0.63636730075877446),
    ('general', 0.62621616907584432),
    ('russia', 0.62233286026418089),
    ('king', 0.61854160459241103),
    ('single', 0.61630514751638699),
    ('killed', 0.61262010905310182),
    ('peace', 0.60775702746632576),
    ('contrary', 0.60750138486684579),
    ('number', 0.59936009740377516),
    ('accompanied', 0.59748552019874168),
    ('clear', 0.59661288775164523),
    ('force', 0.59657370362505935),
    ('army', 0.59584331507492383),
    ('authority', 0.59523854206807647),
    ('troops', 0.59293965397478188),
    ('russian', 0.59077308177196441)]
  5. Shovel all of these links into a network and export a GML file.

Generating graphs

There are two ways to create graphs - you can use the textplot executable from the command line, or, if you want to tinker around with the underlying NetworkX graph instance, you can fire up a Python shell and use the build_graph() helper directly.

Either way, first install Textplot. With PyPI:

pip install textplot

Or, clone the repo and install the package manually:

pyvenv env
. env/bin/activate
pip install -r requirements.txt
python setup.py install

From the command line

Then, from the command line, generate graphs with:

texplot generate [IN_PATH] [OUT_PATH] [OPTIONS]

Where the input is a regular .txt file, and the output is a .gml file. So, if you're working with War and Peace:

texplot generate war-and-peace.txt war-and-peace.gml

The generate command takes these options:

From a Python shell

Or, fire up a Python shell and import build_graph() directly:

In [1]: from textplot.helpers import build_graph

In [2]: g = build_graph('war-and-peace.txt')

Tokenizing text...
Extracted 573064 tokens

Indexing terms:
[################################] 124750/124750 - 00:00:06

Generating graph:
[################################] 500/500 - 00:00:03

build_graph() returns an instance of textplot.graphs.Skimmer, which gives access to an instance of networkx.Graph. Eg, to get degree centralities:

In [3]: import networkx as nx
In [4]: nx.degree_centrality(g.graph)

Texplot uses numpy, scipy, scikit-learn, matplotlib, networkx, and clint.