data-8 / literature-connector

Literature and Data - Spring 2016 Data Science Connector Course
24 stars 9 forks source link

Raw Notes From Teddy - Need to Convert into individual work items #5

Closed anthonysuen closed 8 years ago

anthonysuen commented 8 years ago

Dear all,

It was great meeting with you a little over a week back. This is a follow-up email on a few items we spoke about.

1) Visualizations

There were two kinds of visualizations of textual similarity that I mentioned when we spoke. The first is a dendrogram, in which similarity is determined pairwise so that clusters of similar texts are progressively brought together in branches of a tree. The second is really just a scatter plot. I mentioned PCA in the meeting (referred to as a “biplot”), which is a handy method, but a simpler one might be a vanilla Multi-dimensional scaling (MDS)algorithm to take points (i.e. texts) out of the high dimensional word space and down to a 2D space for visualization. There is an example of R code to do this, here: https://github.com/rochelleterman/text-analysis-dhbsi/blob/master/4-clustering.Rmd

This link has examples of the dendrogram (Fig 9) and the biplot (Fig 1): http://winedarksea.org/?p=816

Ideally we’d get some pretty clusters where the comedies, etc are all clearly grouped. If it’s too messy, I may not use it in the lesson or select a different set of texts (e.g. Jane Austen and Charlotte Bronte).

2) Lesson Ideas

I have instructed two workshops recently that have versions of lessons I’d like to use for the connector.

Lesson 1: Stylometry in Poetry https://github.com/teddyroland/NLTK-Workshop

This lesson and all of its materials are in the Github repo, including an Ipython notebook. The goal of the lesson is to a reproduce a particular graph (“Thornbury 170, Fig 4.5.png”). Basically, I used a list comprehension to get the first letter of each word in a text (‘christ-and-satan.txt’), counted the frequency of each letter, and then graphed their cumulative frequencies.

There were two issues with this lesson. First, python 2.7 doesn’t handle non-ascii characters well, so the list comprehension filters them out. Would it be possible to count the non-ascii characters as well? Second, this last graph was produced using an out-of-the-box tool from NLTK, which made coding easy. But it meant that I could only graph one arc at a time, whereas Thornbury’s has one for each “fitt”, or chapter.

On a related note the y-axis of the NLTK graph is an absolute number of counts (e.g. "h appeared 12 times”), whereas if you compare several texts simultaneously, you want relative frequencies (e.g. “h was 12% of alliteration”) since they may be different lengths. Also, Thornbury wasn’t interested in the specific letters than alliterated, she was only interested in “the most common alliterative letter in Fitt X,” “the second most common letter in Fitt X,” etc, since she’s measuring at the distribution of alliteration over different sounds.

Also, if you open the notebook, you will see that most of the lesson is a general intro to Python. I have a feeling we will not need that. The second part relies on NLTK (http://www.nltk.org/) and NLTK_Data (http://www.nltk.org/data.html). Again, we may not wish to use that package in the lesson, but you’ll need it if you want to run the notebook as-is.

Lesson 2: Literary Metadata https://github.com/teddyroland/Data-Wrangling-Workshop

These workshop materials are also contained (almost) entirely in the github repo. The goal of the workshop is to produce a new version of a particular graph (“Paige-Graph 1.png”). We went through a bibliography of British novels ("1788 G-R.pdf”) and “crowdsourced” data entry from the pdf into a Google spreadsheet. Students entered textual metadata (e.g. title, author, publisher, year) and then went to Eighteenth Century Collections Online (ECCO) through the library website to check whether each book was first/third person and its genre. Once we had collected all of that metadata and turned it into a pandas dataframe, we used some of its methods to produce the visualization.

For the connector, I’d like to do a different version of this (because Professor Paige has not yet published his findings). There are two projects I’m interested in having students explore: 1) Franco Moretti’s work on titles (http://www.jstor.org/stable/10.1086/606125), and 2) Ted Underwood’s work on the rise of the third-person novel (http://tedunderwood.com/2013/09/22/genre-gender-and-point-of-view/).

To make a script for the Underwood project, the process is the same as described for the pandas workshop: “crowdsource” the metadata collection and novel point of view, then graph the percentage of novels in first person by year.

Moretti’s script will require a couple extra steps. He graphs average title length (in words) by year. Will you write a list comprehension to get the length of each title in the crowdsourced metadata and add it as a column to the data frame? I’d like to reproduce Moretti’s graph, ideally with a smoothed trend-line. He also graphs the percentage of long titles (15+ words) for each year on the same plot as the annual percentage of short titles (1-3 words), so that would be useful to have as well.

Might be interesting to do something like sentiment analysis, and see if titles get happier/sadder or more abstract/concrete. NLTK has some sentiment tools.


It was a pleasure meeting you last week, and I’m looking forward to collaborating next semester. Let me know if you have any questions about these materials or adapting them for the connector.

Warmly, Teddy

teddyroland commented 8 years ago

No longer using first toy example, although the suggested Lesson 2 was eventually implemented as Lesson 6.