NaNoGenMo / 2017

National Novel Generation Month, 2017 edition.
https://nanogenmo.github.io
185 stars 7 forks source link

The Average Novel #22

Open aparrish opened 6 years ago

aparrish commented 6 years ago

Hi everyone, I'm going to try to make something this year! I haven't planned out anything yet but it will probably have to do with word embeddings somehow. Also the WikiPlots corpus.

aparrish commented 6 years ago

Some progress! I present: The Average Novel.

I'm still working with Project Gutenberg files from the April 2010 DVD ISO (downloadable here) and Leonard Richardson's 47000_metadata.json. Steps:

(1) Fetch every text in PG that was labelled as fiction and then parsed them into sentences and used gensim's Word2Vec module to calculate 100-dimensional word embeddings from the resulting sentences. (2) Create an array of word embeddings for every text (by looking up each word in the embedding) and normalize the length of these arrays to 50,000 (leaving ~11k arrays of dimensionality (50000,100)). (3) Sum the arrays for every length-normalized text and divide by the number of texts. (4) For each vector in the resulting array, find the word with the closest embedding.

You can see the results here.

I guess I secretly hoped that this technique would reveal, average face-like, the Narrative Ur-text underlying all storytelling. But the result is pretty much what I actually expected: all of the structural variation gets lost in the wash. (The Produced and Proofreaders tokens at the top are obviously remnants of PG credits and boilerplate that weren't caught by the filtering tools I'm using; the , token just happens to have been the vector most central to the average, which I guess kinda makes sense given how Word2Vec works. Not sure what all those pachyderms are doing in there though.)

I'm planning to continue experimenting with this technique, but wanted to share this progress in case further experiments extend past the deadline.

swizzard commented 6 years ago

If nothing else, pachyderms.camp would make for a great mastodon instance domain

On Tue, Nov 28, 2017 at 3:23 PM, Allison Parrish notifications@github.com wrote:

Some progress! I present: The Average Novel.

I'm still working with Project Gutenberg files from the April 2010 DVD ISO (downloadable here http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project and Leonard Richardson's 47000_metadata.json https://twitter.com/leonardr/status/667049187918356480). Steps:

(1) Fetch every text in PG that was labelled as fiction and then parsed them into sentences and used gensim's Word2Vec module to calculate word embeddings from the resulting sentences. (2) Create an array of word embeddings for every text (by looking up each word in the embedding) and normalized the length of every novel to length 50,000 exactly. (3) Sum the arrays for every length-normalized text and divide by the number of texts (~11k). (4) For each vector in the array, find the word with the closest embedding.

You can see the results here https://gist.github.com/aparrish/86daccdfa4f338b1d33e98d1624029d7.

I guess I secretly hoped that this technique would reveal, average face-like https://pmsol3.wordpress.com/, the Narrative Ur-text underlying all storytelling. But the result is pretty much what I actually expected: pretty much all of the structural variation gets lost in the wash. (The Produced and Proofreaders tokens at the top are obviously remnants of PG credits and boilerplate that weren't caught by the filtering tools I'm using; the , token just happens to have been the vector most central to the average, which I guess kinda makes sense given how Word2Vec works. Not sure what all those pachyderms are doing in there though.)

I'm planning to continue experimenting with this technique, but wanted to share this progress in case further experiments extend past the deadline.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NaNoGenMo/2017/issues/22#issuecomment-347668256, or mute the thread https://github.com/notifications/unsubscribe-auth/AB0knsj4pnqo-UuZytJmUMXXmYMlcTGrks5s7HnvgaJpZM4QDH_x .

-- .

aparrish commented 6 years ago

going to post the source code for this soon, stay tuned!

moonmilk commented 6 years ago

This part is so beautiful. image

aparrish commented 6 years ago

I'm a day late but I posted the source code and a new version of the output. For the new version, I decided to ignore punctuation tokens when calculating the vectors for each novel. The result has fewer commas and variation that is a bit more interesting IMO!