data-8 / literature-connector

Literature and Data - Spring 2016 Data Science Connector Course
24 stars 9 forks source link

Demo Exercise #1 #1

Closed anthonysuen closed 8 years ago

anthonysuen commented 8 years ago

Attached is a zip file containing the texts of Shakespeare’s plays. All non-dialogue text has been removed from them. Also attached is a csv with the full title and genre of each play.

I would like to do an exercise in which we use word frequencies to measure textual similarity among these plays. There have been several prominent demonstrations recently in which unsupervised methods have “discovered” the genre of Shakespearean plays: comedies, histories, and tragedies each tend to cluster together on the basis of word frequencies. Ultimately, I would like to produce some kind of visualization of those clusters. (NOTE: In these demonstrations, the clusters are not perfect — which can be meaningful too.)

I’m not married to any particular measurement of similarity — correlation, cosine, etc. — although it might be useful to calculate similarity a couple different ways to show how the findings differ. Similarly, I am open to different modes of visualization — dendrogram, multi-dimensional scaling, etc. Whatever methods are most compatible with the content of the main course should be fine.

Thank you very much for this. I look forward to seeing the notebook. If you have any questions, please send them my way.

teddyroland commented 8 years ago

No longer implementing this toy example.