dariusk / NaNoGenMo-2014

National Novel Generation Month, 2014 edition.
257 stars 17 forks source link

God is a TJ #57

Open y-a-v-a opened 9 years ago

y-a-v-a commented 9 years ago

TJ abbreviation of Text Jockey. http://www.god-is-a-tj.com/ When you let it run for a month in an open browser window, you will have a novel written by the web, using only snippets from the Bible as found on Project Gutenberg. Code will follow on https://github.com/y-a-v-a

y-a-v-a commented 9 years ago

So "God is a TJ" was an idea me and a friend had in 2010: retrieve random snippets from the bible and add them to generate a new text (not to generate a new bible). Every sentence is made up of three parts: a beginning part that starts with a capital, a middle part containing neither a capital nor a dot, and a finishing part that ends with a dot. That way, you'll get a new sentence every 3 requests. One of the biggest surprises IMHO is that the new sentences read quite naturally, even though it should be bogus. Plan for now is to rewrite part of the code to generate a novel for #NaNoGenMo-2014. I'll keep posting here. The repo is https://github.com/y-a-v-a/tj

y-a-v-a commented 9 years ago

As a side note: next to having a bible-based version, we also have one based on articles from a famous Dutch art magazine called Metropolis M, which we called Neurpolis N. This one creates new articles on art based on the contents of the original site.

wordsmythe commented 9 years ago

Fascinating! How well does that handle proper-noun capitalization in the middle of sentences? Does your Gutenberg text include capitalizations for "Gospel" and pronouns that refer to God?

y-a-v-a commented 9 years ago

I wrote some regular expressions to chop the text into parts, but I wrote them 4 years ago, so I have to take a look again to get the full hang of them again ;-)

y-a-v-a commented 9 years ago

Just added a basic PHP class that's a wrapper around the line-retrieval, and a simple generator script to create a certain amount of lines.

y-a-v-a commented 9 years ago

I got a result actually. After letting it run for a while, I have a text consisting of 50,002 words, accidentally beginning with the word 'God' and ending with 'etc.', which is quite literary, I thought.

Result can be found here: https://github.com/y-a-v-a/tj/blob/master/bin/text20141103-223023.txt and raw https://raw.githubusercontent.com/y-a-v-a/tj/master/bin/text20141103-223023.txt

y-a-v-a commented 9 years ago

Last week I wrote a new PHP class that handles the Gutenberg bible texts: https://github.com/y-a-v-a/tj/blob/master/www/BibleLine.class.php

        switch ($this->sp) {
            case '0':
                preg_match_all("/[A-Z]{1}[a-z]*[ ]{1}([a-z0-9:\-,]*[ ]){4,12}/",$cnt, $matches);
                $matches[0] = array_map(function($item) { return trim($item, "\n\r\t;:. ,"); }, $matches[0]);
                break;
            case '1':
                preg_match_all("/[ ]{1}[a-z]{1}([A-Za-z\-,:]*[ ]){6,}/",$cnt, $matches);
                break;
            case '2':
                preg_match_all("/([ ]+?[a-z][A-Za-z\-,:]*){6,13}([\.\?!]){1}/",$cnt, $matches,PREG_PATTERN_ORDER);
                break;
        }

The above lines are the real text processors, which means, the three regular expressions are in order responsible for a set of words starting with a capital letter, then a part that starts neither with a capital nor ends with a dot, and the last one matches on a set of words ending with a dot (or question mark, etc.). These parts are concatenated after each other so the result is long sentence with hickups, multiple subjects, etc. See https://github.com/y-a-v-a/tj/blob/master/bin/text20141108-231139.txt for an outcome of the code. It's just over 50k words.