NaNoGenMo / 2021

National Novel Generation Month, 2021 edition.
44 stars 8 forks source link

Dissociated Parse #62

Open cpressey opened 2 years ago

cpressey commented 2 years ago

It's well known that Markov chains don't understand grammar. Any sequences in the output that might look grammatical are only there because grammatical-looking sequences are statistically likely.

Off and on I've been interested in seeing if one can make variations of Markov generators that do retain some of the syntactic structure of the original text.

I had some success with it in 2019 with Anne of Green Garbles, which shows that (roughly speaking) one can combine two Markov models to obtain a third model where the generation can switch between states (like "in narration" and "in dialogue").

This is a second experiment with this goal.

tl;dr -- we can run the Dissociated Press algorithm in a context where we retain the syntactic structure associated with the text, that is, not just on a list of words like usual, but on a forest of parse trees. I call this variation Dissociated Parse.

Obtaining some parse trees from The Wonderful Wizard of Oz using the link-grammar parser, and running Dissociated Parse on them, produces text like:

everything had had it .

she dearly looked the bread were Dorothy carried the rest and filled dried leaves to the Emerald City , not fall into aunt Em , and then shook the tree .

near Dorothy as the country were her been lived as long she he sang briskly . .

to then , the room did not had about it .

the strong pressure at first know Toto until morning dressed than Dorothy that she oiled the house .

So, this is clearly generating nonsense, like a Markov chain would, but it is also retaining some of the source's syntactic structure, like a Markov chain wouldn't. (It still doesn't understand anything about tense and aspect, though; and the link-grammar parser doesn't always parse exactly as you'd expect; so for these reasons it still frequently falls short of "syntactically well-formed".)

Dissociated Press

For background, a description of the Dissociated Press algorithm.

Just as there is more than one algorithm for sorting, there is more than one algorithm for generating a Markov chain.

The usual algorithm involves analyzing the source text and building a weighted transition table, then finding a random (but likely) path through this table.

The probably less well-known algorithm called Dissociated Press goes like this:

  1. load all the words into a list in memory
  2. select some word as the starting word
  3. print the current word
  4. find all occurrences of the current word in the text
  5. select one of those occurrences at random and jump to it
  6. move to the next word in the text
  7. repeat from step 3

Even though this works rather differently from the transition table algorithm, it produces the same result. (Exercise for the reader: convince yourself that it does in fact produce the same result.)

One downside of this algorithm is that it requires the entire text be kept in memory, rather than just a transition table. But, this is also an upside, in the sense that variations on the algorithm can exploit structure in the text which would not be retained in the transition table.

Dissociated Parse adapts this to work recursively on parse trees. Consider a parse tree to consist of a word, a part-of-speech tag, and zero or more child trees. Here is a sketch of the algorithm:

traverse(tree):
    1. find all trees that have the same part-of-speech tag as this tree
    2. select one of those trees at random and replace this tree with it
    3. print the word of this tree
    4. for each child of this tree, traverse(child)

That's all it is. One obvious improvement that I haven't tried yet is to have it find all trees that have the same part of speech and word as the current tree. I hope to try that. And to submit a 50,000-word long run too. But in case I do not get around to those, I wanted to at least get this writeup out.

cpressey commented 2 years ago

There is a repository now, here: https://github.com/catseye/Dissociated-Parse

I have also adapted the algorithm as I mentioned, to pick each new tree based on the part of speech plus the first word of current tree. (The link-grammar parser sometimes bundles a number of words into each tree.) The output is similar -- perhaps a little more structured, or perhaps that is just my imagination:

Winkies were unharnessed from terror and scampered away times before the Throne Room to not say anything , , for kissing his head near the most , and they who to have she had come any nearer wings on the Yellow Castle until to come by was loaded with sawdust and nuts , by he came , they and many other good things to her dress .

they would lie beside her .

he fell asleep only .

to give the place a nest .

another question came in a hearty rippling brook , and soon after they of ripening grain , with her head her castle by it in a voice of the blocks .

she owned half it in her comrades .

she stopped so furiously that there is as industrious the room that she sat upon the Wicked Witch .

MichaelPaulukonis commented 2 years ago

Now, throw a second text at it (say, the Apocalypse Now script) ......

cpressey commented 2 years ago

Now, throw a second text at it (say, the Apocalypse Now script) ......

That is indeed a suggestion. If it happens, it won't've been me that made it happen.

Ars longa, November brevis.

cpressey commented 2 years ago

A novel generated using this technique can be found here:

The Lion, the Witches, and the Weird Road

wc -w tells me it is 50,277 words long, so could I get a coveted green 'completed' tag please.

hugovk commented 2 years ago

Enjoy your well-deserved 'completed' tag!

cpressey commented 2 years ago

A postscript of sorts, for anyone who might still be reading this issue.

When I was in the middle of working on this entry, the output was not bad, lexically speaking, but it was really lacking in terms of correct (or at least plausible) punctuation, spacing, and capitalization.

This is a situation I've encountered before, in a previous NaNoGenMo, and to deal with it back then I wrote a tool called T-Rext that tries to clean up those things in a mechanical way.

So I wanted to use T-Rext in this project too. But when I tried it again, I was disappointed on two fronts:

I ended up using it regardless -- I ran it from a Python 2.7-based Docker image and wrote an ad-hoc cleanup script to run after it, to compensate for its shortcomings.

But after NaNoGenMo ended I had some time to go back and fix it properly. So, there is now a version 0.2 of T-Rext which runs under Python 3 and does more thorough cleanup, including capitalization.

If I had made these improvement before or during NaNoGenMo I wouldn't have had to write that ad-hoc cleanup script.

T-Rext is in the public domain so please feel free to use it in your own efforts if you feel it would be useful to you.

That's all I wanted to say. Have a nice day, Happy Holidays, and see you again next year.