dariusk / NaNoGenMo

National Novel Generation Month. Because.
184 stars 16 forks source link

RESOURCES! #11

Open dariusk opened 10 years ago

dariusk commented 10 years ago

This is an open issue where you can comment and add resources that might come in handy for NaNoGenMo.

NOTE: at some point I will turn this into a more organized document, probably on the wiki for this repo.

dariusk commented 10 years ago

A submission from @scottmadin:

Python Markov chains: https://pypi.python.org/pypi/PyMarkovChain/ Python internet archive api: https://pypi.python.org/pypi/internetarchive/0.4.4

Also, similar things in NodeJS:

https://npmjs.org/package/archive.org https://npmjs.org/package/markov

darkliquid commented 10 years ago

http://stackoverflow.com/questions/353274/story-telling-building-algorithms

willf commented 10 years ago

I wrote a "Samsa bot" that uses Bing's Ngram database to generate text. You might find it and the associated libraries useful (all Ruby).

https://github.com/willf/microsoft_ngram/blob/master/examples/samsabot.rb

General library:

https://github.com/willf/microsoft_ngram

dariusk commented 10 years ago

Since @willf is too humble to plug it, Wordnik is an indispensable resource for all things text-related: definitions, parts of speech, random words, rhymes, hypernyms, etc:

http://developer.wordnik.com/docs.html#!/word

vitorio commented 10 years ago

Here's a dump of my notes about generating stories:

@rfreebern researched this problem a few years back for this game project of his:

Curses! is a single-player open-ended adventure game with the basic premise that the player is a fairy tale villain bent on wrecking many potential fairy tales as completely as possible. Fairy tale plots would be generated on-the-fly based on a basic generator template that attempts to intelligently combine dozens or hundreds of very basic fairy tale elements to create situations that are both unique and familiar. The PC's goal is not to just thwart the happy ending but to do it thoroughly: not just kill the handsome prince, but cripple and disfigure him while making the princess hate him and get exiled from her kingdom, for example.

Fairy tales are really well-explored variants of the standard storytelling archetypes described by people like Joseph Campbell. There are a couple of ways that fairy tales are organized, which include their plot outlines (although not their cultural or moral implications): Aarne-Thompson, and Propp. http://en.wikipedia.org/wiki/Aarne-Thompson_classification_system

Propp's classification system has been used as the basis for a number of generators and is still the most-used mechanism in the academic literature for such things: http://en.wikipedia.org/wiki/Vladimir_Propp

Propp generators are things like: http://www.fdi.ucm.es/profesor/fpeinado/projects/kiids/apps/protopropp/

Clicking through to their later Bard system shows examples at the bottom, and that whole KIIDS things is for interactive narrative and computational narratology, which are the academic terms for this sort of thing (I call my work in this area automated storytelling with post-hoc computational narratives, as my use and implementation aren't for interaction).

Mark Finlayson's work out of MIT is a little more recent: http://www.mit.edu/~markaf/research.html

Plugging any of that research into Google Scholar and looking at recent citations of those papers are a good way to catch up.

The massively-multiplayer video game Star Wars Galaxies tried something along these lines with their Dynamic Points of Interest, but they weren't really well executed from a design and technical implementation perspective. They had a lot of potential, but Raph Koster describes their problems here: http://www.raphkoster.com/2010/04/30/dynamic-pois/

Outside of fairy tales, there are works like Plotto, which provide narrative guides to plot generation, and the monomyth-related works by Campbell, etc.: http://www.brainpickings.org/index.php/2012/01/06/plotto/

Plotto is actually in the public domain, and can be found in the Internet Archive here: https://archive.org/details/plottonewmethodo00cook

And journalism is getting into it, too. A program at Northwestern worked out so well, taking sports stats and turning them into sports articles, they didn't publish much research at all and went right into a startup. The Wired article is here: http://www.wired.com/gadgetlab/2012/04/can-an-algorithm-write-a-better-news-story-than-a-human-reporter/all/1

The one paper I found by the Northwestern group cites one major paper from 1977 about "Tale-spin." You can look for citations from the Tale-spin article, and that brings up some interesting recent work from elsewhere: http://scholar.google.com/scholar?cites=8316499405683938909&as_sdt=5,44&sciodt=0,44&hl=en

Finally, there's this failed Kickstarter: http://www.kickstarter.com/projects/storybricks/storybricks-the-mmorpg-storytelling-toolset

Even more finally, I also found this PDF in a second set of notes: https://research.cc.gatech.edu/inc/content/sequential-recommendation-approach-interactive-personalized-story-generation

darrentorpey commented 10 years ago

Thanks, @vitorio! That looks helpful.

smadin commented 10 years ago

(OK, I made a github account.) https://pypi.python.org/pypi/wikipedia/1.0.3 is a python interface to wikipedia, which may also be helpful for the quick-and-dirty Markov-chain approach. It was very easy to hack together a script to fetch random Wikipedia tables for source text and churn out a "novel" of a given word-count.

nickheer commented 10 years ago

SC Chen's Simple HTML DOM Parser for PHP.

dariusk commented 10 years ago

While in-browser DOM manipulation is obviously ruled by jQuery, my favorite NodeJS DOM parser/manipulator is Cheerio, which uses jQuery-style selectors.

Also if you're in Ruby and need to do HTML/XML parsing, Nokogiri rules the roost.

rfreebern commented 10 years ago

I'm hanging out in #nanogenmo on FreeNode if anyone wants to join. We can toss ideas around on a casual basis there.

dariusk commented 10 years ago

For those who aren't super IRC-literate, or just don't want to install an irc client, you can go here, pick a username, and visit #nanogenmo from your web browser:

http://webchat.freenode.net/?channels=#nanogenmo

jiko commented 10 years ago

The Bard project looks awesome. Thanks @vitorio!

jiko commented 10 years ago

Some Python resources:

agladysh commented 10 years ago

An article about generator of Recursive Fairy Tales in Haskell (in Russian): http://habrahabr.ru/post/136007/

Google Translate: http://translate.google.com/translate?hl=en&sl=ru&tl=en&u=http%3A%2F%2Fhabrahabr.ru%2Fpost%2F136007%2F

darkliquid commented 10 years ago

Not strictly related, but there are several story-based/narrative-focused roleplaying games that could be used/formalised into a system for generating overall plot structures. I'm currently looking at Microscope, Fiasco and FATE Core as potential systems for having characters 'play' through a game and recording what they do and what actions they take to generate stories.

jiko commented 10 years ago

Here's some of my Python code for generating sentences based on supplied text. None of the Twitter-related code has been tested with v1.1 of the Twitter API, but worked fine on v1.

jiko commented 10 years ago

The Dada Engine, which powers the infamous Postmodernism Generator, might come in handy. There's an online manual and a clone on GitHub.

erkyrath commented 10 years ago

Not a resource, but a suggestion: when you complete a novel, change the title of your issue to "$NovelTitle by $Author", so that we can easily browse them.

(Yeah, someone is now going to actually title their novel "$NovelTitle".)

If I were an over-organizational nerd, I would suggest setting up appropriate issue tags ("In Progress", "Complete", "Stupid Ideas", etc). But I leave that up to whether Darius is an over-organizational nerd.

dariusk commented 10 years ago

I agree with you @erkyrath -- I'll try and prod people to do that when they're done. Issue tags... I might start labeling things myself!

dariusk commented 10 years ago

Okay, I opened a new Issue ( #42 ) for general discussion. This thread remains the place for technical resources; the other thread is open to everything else.

vitorio commented 10 years ago

Ficly ( http://ficly.com/stories and its predecessor Ficlets http://ficlets.ficly.com/ ) is a very-short-story writing community, where you have a 1024 character limit. There are lots of tiny stories on the site, but also, you can fork any story and write prequels and sequels to it. Some stories have multiple prequels and sequels, like an unintentional choose-your-own-adventure.

All of the Ficly and Ficlets content is licensed CC-BY-SA.

In late May 2013, I scraped all of Ficly and dumped 13,144 stories, all of which had at least one prequel or sequel, into a matching amount of JSON files (there should be no standalone 1k character stories). Each JSON file records the ID, URL and title of the story; the author's avatar, name and URL; the IDs and URLs of prequels and sequels; and the story content in Markdown.

The scraper (in Python) is probably a little prickly, as it's mostly uncommented, but the .zip of 13k JSON files could be dumped straight into a JSON document store and worked with directly. Perhaps someone wants to generate 50k words of choose-your-own-adventure stories or something.

https://github.com/vitorio/NaNoGenMo2013

darkliquid commented 10 years ago

I've done some basic gathering of info over a few sources to generate a bunch of sentence structures using parts-of-speech tagging while I've been researching. Other might find this useful, so you can find them here: https://github.com/darkliquid/NaNoGenMo/tree/master/data

The data is basically one sentence to a line, each line containing a stream of space separated parts-of-speech tags. There are likely to be mistakes in the set as I've hacked this together without any real understanding of what it is I'm doing or what I yet hope to achieve from it, but have at it and good luck!

dariusk commented 10 years ago

To be clear, @darkliquid's output can be interpreted by looking at this list of part of speech tags.

aparrish commented 10 years ago

this might be inspiring for some folks http://en.wikipedia.org/wiki/Postmodern_literature#Common_themes_and_techniques

catseye commented 10 years ago

It would be very difficult to use it in an automated way (and I realize it may be unpopular with some participants) but if you haven't heard of it, there's this site called TVTropes. It contains a vast array of, well, tropes (from fiction in general, mostly mass-media but not exclusively television,) pre-deconstructed for your convenience. For example, Applied Phlebotinum.

lazerwalker commented 10 years ago

Speaking of parts-of-speech tagging (cc @darkliquid), if you're literate in Objective-C Apple's NSLinguisticTagger API is fantastic. (http://nshipster.com/nslinguistictagger/)

darkliquid commented 10 years ago

Wow, that is nice. Sadly it's of no use to me in linux world but that looks like a much richer source of data for the kinds of analysis I'm looking to do.

On another note, I've started annotating the parts-of-speech tag definitions with example words and some extra rules for their use in sentences where applicable (which hopefully I can then use to scan my sentence structure list to bin structures that are grammatically incorrect). https://github.com/darkliquid/NaNoGenMo/blob/master/data/tag_types.txt

enkiv2 commented 10 years ago

WordNet can be coaxed into doing part of speech tagging (in addition to providing synonyms, antonyms, and other related words), although part of speech tagging requires a hack (iterate over parts of speech until the word has a synonym in that group, then guess which part of speech the word is actually being used as). I'd recommend using that on *nix, since it has other (more useful) functions.

Tangentially, I have a resource to contribute. https://github.com/enkiv2/synonym-warp will take a text document and randomly replace some words with synonyms (which slightly warps the semantics since the synonyms it uses aren't necessarily appropriate to the context). It expects to run on a unix under zsh, with wordnet in the path. I'm planning to run input texts through it before training a markov model, to add a little noise.

On Mon, Nov 4, 2013 at 11:18 AM, Andrew Montgomery-Hurrell < notifications@github.com> wrote:

Wow, that is nice. Sadly it's of no use to me in linux world but that looks like a much richer source of data for the kinds of analysis I'm looking to do.

On another note, I've started annotating the parts-of-speech tag definitions with example words and some extra rules for their use in sentences where applicable (which hopefully I can then use to scan my sentence structure list to bin structures that are grammatically incorrect). https://github.com/darkliquid/NaNoGenMo/blob/master/data/tag_types.txt

— Reply to this email directly or view it on GitHubhttps://github.com/dariusk/NaNoGenMo/issues/11#issuecomment-27698071 .

jiko commented 10 years ago

@darkliquid Nice work! Part of speech tagging seems like a fruitful avenue.

I've played with this Javascript PoS tagger in the last few days. I found it through The node.js Natural Language Story blog post by the maintainer of a package of general natural language facilities for node. I found another interesting Node package to generate random sentences from BNF grammars, along the lines of the Dada Engine mentioned above.

jiko commented 10 years ago

Codewalk: Generating arbitrary text: a Markov chain algorithm in Go.

vitorio commented 10 years ago

Creating biographies of people using just their Twitter stream: http://www.fastcolabs.com/3021091/this-algorithm-can-tell-your-life-story-through-twitter?partner=rss&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+fastcompany%2Fheadlines+%28Fast+Company%29

darkliquid commented 10 years ago

Some lists of names, places, occupations, etc for generating character details.

Names http://stackoverflow.com/questions/1803628/raw-list-of-person-names

Titles http://www.gutenberg.org/dirs/GUTINDEX.ALL

US Cities http://wiki.skullsecurity.org/images/5/54/US_Cities.txt

Job Titles http://www.bls.gov/soc/soc_2010_direct_match_title_file.xls

Adjectives http://www.enchantedlearning.com/wordlist/adjectives.shtml

Nouns http://www.momswhothink.com/reading/list-of-nouns.html

enkiv2 commented 10 years ago

For anybody rolling their own grammars, I found a constraint solver in python: https://github.com/switham/constrainer

On Wed, Nov 6, 2013 at 4:37 AM, Andrew Montgomery-Hurrell < notifications@github.com> wrote:

Some lists of names, places, occupations, etc for generating character details.

Names http://stackoverflow.com/questions/1803628/raw-list-of-person-names

Titles http://www.gutenberg.org/dirs/GUTINDEX.ALL

US Cities http://wiki.skullsecurity.org/images/5/54/US_Cities.txt

Job Titles http://www.bls.gov/soc/soc_2010_direct_match_title_file.xls

Adjectives http://www.enchantedlearning.com/wordlist/adjectives.shtml

Nouns http://www.momswhothink.com/reading/list-of-nouns.html

— Reply to this email directly or view it on GitHubhttps://github.com/dariusk/NaNoGenMo/issues/11#issuecomment-27855298 .

elib commented 10 years ago

I don't know if anyone has referenced this crucial resource. https://www.youtube.com/watch?v=FUa7oBsSDk8

darkliquid commented 10 years ago

I've been running a term extraction for the last couple of days that just finished running. It has various 'terms' i.e. the key noun or noun phrase/topic that a sentence is about, extracted from around half a million sentences across a wide range of sources (gutenberg novels, news articles, etc). I'm not sure I'll even use it now, but it might be of use for people looking to seed their stories with random topics.

https://github.com/darkliquid/NaNoGenMo/blob/master/data/terms_cleaned.txt.gz

enkiv2 commented 10 years ago

I was inspired by somebody's example of dialogue generation, and so I wrote some code to parse an ontology and create some question/answer pairs based on categories: https://github.com/enkiv2/NaNoGenMo2013

At some point, I'll need to hack it to generate other kinds of dialogue.

On Wed, Nov 6, 2013 at 4:15 PM, Andrew Montgomery-Hurrell < notifications@github.com> wrote:

I've been running a term extraction for the last couple of days that just finished running. It has various 'terms' i.e. the key noun or noun phrase/topic that a sentence is about, extracted from around half a million sentences across a wide range of sources (gutenberg novels, news articles, etc). I'm not sure I'll even use it now, but it might be of use for people looking to seed their stories with random topics.

https://github.com/darkliquid/NaNoGenMo/blob/master/data/terms_cleaned.txt.gz

— Reply to this email directly or view it on GitHubhttps://github.com/dariusk/NaNoGenMo/issues/11#issuecomment-27914090 .

warnaars commented 10 years ago

You might find this an interesting take on 'automated content authorship' http://youtu.be/SkS5PkHQphY

MichaelPaulukonis commented 10 years ago

@warnaars Philip M. Parker! I would love to see some of his novelistic output.... I'd really love to see some of his code. I've got some more links on him at http://www.xradiograph.com/WordSalad/AutomaticForThePeople

lilinx commented 10 years ago

"If the atoms have by chance formed so many sorts of figures, why did it never fall out that they made a house or a shoe? Why at the same rate should we not believe that an infinite number of Greek letters, strewed all over a certain place, might fall into the contexture of the Iliad?" Michel de Montaigne (1533-1592), Essais

ikarth commented 10 years ago

For that matter, how about a Library of Babel generator? (Not mine) http://dicelog.com/babel

notio commented 10 years ago

Not open source, but still! The Fiction Idea Generator is interesting: http://figapps.net/fig.html

It's free this month (iTunes): https://itunes.apple.com/app/fiction-idea-generator-ef/id507536455?mt=8

lilinx commented 10 years ago

Also you might be interested in the works of Jean-Pierre Balpe This man has been doing generative literature experiment for a while. He has countless bot-blogs generating the weirdest things. Unfortunately he seems to do everything in French : it's very difficult to find anything about him in English (even no english wikipedia article). But there is this short article : http://www.digitalarti.com/blog/digitalarti_mag/portrait_jean_pierre_balpe_inventor_of_literature

catseye commented 10 years ago

In one issue here somewhere I obliquely suggested generating a graphic novel -- that is to say, a comic book. While I would love to try, I definitely won't have the time to do this in what remains of November, but here are some resources I found while researching it:

http://openclipart.org is a collection of SVG images, all in the public domain. It can also render them as PNGs for you, at the scale you choose. It has a JSON API: http://openclipart.org/developers

If you wanted to use that JSON API on your own web page (perhaps to display these images on an HTML5 canvas element) you could use this generic JSONP proxy to make a mockery of the same-origin policy: http://jsonp.jit.su/

Here is a library of onomatopoeic sound-effects: http://www.writtensound.com/index.php Not sure how easy it would be to scrape, but probably wouldn't be hard to pick a random item from a desired category, like: http://www.writtensound.com/index.php?term=movement

Here is a list of catchphrases: https://en.wikipedia.org/wiki/List_of_catchphrases

And, just for that extra dadaist touch & in no way limited to graphic novels, here is a list of various abuses of the statistical meaning of p-value, collected from various academic papers: http://mchankins.wordpress.com/2013/04/21/still-not-significant-2/

What I imagine the result of using these resources to be something like:

a sombrero with a word balloon saying "Cowabunga" next to Tux (the Linux penguin) with a thought bubble saying "did not quite reach conventional levels of statistical significance (p=0.079)"... with the word SCHHWAFF at a slight angle and in a large-point font, in the background

MichaelPaulukonis commented 10 years ago

@catseye check out blotcomics and the graphic novel harsh noise.

I can't shake the feeling that the end result of your automation, however, will end up looking like ELER. ep064 source

ikarth commented 10 years ago

If we're going graphical I should probably mention the billion-year archives of the webcomic mezzacotta: http://www.mezzacotta.net/

bredfern commented 8 years ago

You can take a look at the text of my Automated Lovecraft project here: https://github.com/bredfern/automated-lovecraft/blob/master/automated_lovecraft.md

bredfern commented 8 years ago

The interesting thing I learned is that more firepower doesn't produce a better result there's a sweet spot between the size of the data set and the number of layers, so to train on all of lovecraft's text I got the best results using torch with just 4 layers. Since I was running off char nn most of the code I wrote and just bash script actually to run torch processes. I want to get deeper into this stuff so I can go further with it but its exciting to see the training result never having done this before.

hugovk commented 8 years ago

@bredfern Wrong repo! This is the 2013 one, here's this year's: https://github.com/dariusk/NaNoGenMo-2015/issues/1