dariusk / NaNoGenMo-2014

National Novel Generation Month, 2014 edition.
258 stars 17 forks source link

The Quantum Supposition of Oz #137

Open spc476 opened 9 years ago

spc476 commented 9 years ago

A Markov chain of order-3 based on the Oz novels written by L. Frank Baum (14 novels in total). The only unusual thing here is that I considered punctuation as "words" in addition to the end-of-paragraph, so that you don't get a "wall of text" but something that is a bit more readable (even if the punctuation is separated by space when it shouldn't be).

The code is github: https://github.com/spc476/NaNoGenMo-2014 and the sample novel can be read at https://github.com/spc476/NaNoGenMo-2014/blob/master/TheQuantumSuppositionOfOz.txt

And my blog entry that goes into more detail about how it works: http://boston.conman.org/2014/11/29.1

cpressey commented 9 years ago

Nice technique for handling the punctuation -- it does make it more coherent(-seeming) than a run-of-the-mill Markov chain.

It should be possible to clean up the intervening spaces with a postprocessor... I wrote one (here) for my own novel, but admittedly I didn't have quotation marks to deal with.

ikarth commented 9 years ago

I ran into a similar issue with punctuation last year and ended up solving it with a postprocessing step. I'm starting to think that it makes sense to have the generator emit marked-up XML or something and then run clean-up on it as a matter of course.

cpressey commented 9 years ago

Outputting some kind of tree structure (like XML) and then flattening it (sensibly) is a good approach.

On the other hand, this level of punctuation/spacing messiness is nothing a few rewriting rules can't clean up.

Given that this seems to be a "problem" that several participants have encountered, I'm working on generalizing the code I wrote into a proper reusable tool of some sort. (nice change to be doing engineering again after all that hackery science, too)

Here's what it does, so far, on an excerpt from The Quantum Supposition of Oz:

“Please tell Ozma, Dorothy, and when I visit Ozma she sometimes allows me to ride upon his back, one seat for each member of the council. The” H. M. “meant Highly Magnified, if you like,” said he.

“I dunno where this tunnel in the mountain he said to himself:

“Do,” said Nikobob, “said the stuffed one, seriously.

“I've forgotten, and I'm surprised that I was not a live thing; you're a dummy.”

“It's just nonsense!” declared Dorothy.

(I love that last line :)

I don't know how long I'll spend on perfectionistically engineering this, but I'm hoping to end up with something like BeautifulSoup except for plain text.

If I'm happy with it before 11 more months have passed, I'll announce it on next year's Resources issue :)

enkiv2 commented 9 years ago

I've had good luck in the past treating punctuation as its own token, then normalizing with sed 's/ ([.,?!:\;]) /\1 /g;s/ ([([]) ([A-Za-z0-9])/\1\2/g;s/([A-Za-z0-9]) ([)]]) /\1\2/g' -- in other words, left-aligning all the stops and the right-hand grouping symbols and right-aligning the left-hand grouping symbols. Then, you need another stage for handling quotes -- but without balancing, that's more of a pain.

On Tue Dec 02 2014 at 6:04:28 AM Chris Pressey notifications@github.com wrote:

Outputting some kind of tree structure (like XML) and then flattening it (sensibly) is a good approach.

On the other hand, this level of punctuation/spacing messiness is nothing a few rewriting rules can't clean up.

Given that this seems to be a "problem" that several participants have encountered, I'm working on generalizing the code I wrote into a proper reusable tool of some sort. (nice change to be doing engineering again after all that hackery science, too)

Here's what it does, so far, on an excerpt from The Quantum Supposition of Oz:

“Please tell Ozma, Dorothy, and when I visit Ozma she sometimes allows me to ride upon his back, one seat for each member of the council. The” H. M. “meant Highly Magnified, if you like,” said he.

“I dunno where this tunnel in the mountain he said to himself:

“Do,” said Nikobob, “said the stuffed one, seriously.

“I've forgotten, and I'm surprised that I was not a live thing; you're a dummy.”

“It's just nonsense!” declared Dorothy.

(I love that last line :)

I don't know how long I'll spend on perfectionistically engineering this, but I'm hoping to end up with something like BeautifulSoup http://www.crummy.com/software/BeautifulSoup/ except for plain text.

If I'm happy with it before 11 more months have passed, I'll announce it on next year's Resources issue :)

— Reply to this email directly or view it on GitHub https://github.com/dariusk/NaNoGenMo-2014/issues/137#issuecomment-65214422 .

MichaelPaulukonis commented 9 years ago

A different approach to markov tokenization - I've worked with punctuation before in different ways, but for text blobs, so I never had to worry about the spacing. I appreciated the links to Racter/PBiHC, since I hadn't seen the template details before.

spc476 commented 9 years ago

You're welcome. It's surprising there's so little information about Racter out there (and according to Google, I appear to be one of the experts about Racter---sigh). The source to Racter is out there, but what is there appears to be the post-processed output from INRAC, a custom language used to write Racter. It's bizarre (http://boston.conman.org/2008/06/18.2).

cpressey commented 9 years ago

That... is actually a pretty nifty control structure. "Find all labels that match this pattern, then pick one of those labels at random and call it."