NaNoGenMo / 2017

National Novel Generation Month, 2017 edition.
https://nanogenmo.github.io
186 stars 7 forks source link

1001 plots #112

Open WhiteFangs opened 6 years ago

WhiteFangs commented 6 years ago

My idea for this year is to generate 1001 plots, each around 50 words, and their titles using the WikiPlots dataset and simple Markov chains.

I didn't think I would find time in November to join this year's edition but I found one available evening and started this. My handicap is that I planned to do this in only a few hours and using PHP (for a lot of -not very good- reasons).

Anyway, I started a few hours ago and I struggled to get the statistical model for my Markov chains generator from a 220Mo text file containing all the plots but I found a way (by cutting it into smaller files basically). But now I'm stuck with a >200Mo PHP array that I will try to use to generate the small plots. Let's hope it will work, pray for my RAM.

I plan to release the array generation code as well as the text generation code (but not the full data because it's a bit heavy and can be rebuild using the dataset).

WhiteFangs commented 6 years ago

I ended up using a lighter version of my PHP array, it was still more than 70Mo but reasonnably usable. Here's the code : https://github.com/WhiteFangs/1001plots The resulting text is in the 1001-plots.html file, also on my website: http://louphole.com/divers/1001-plots.html

I was hoping to get more readable plots but I fear the markov chains were not sufficient for this time. Anyway, I plan to update the ReadMe later and maybe try to generate another sample with the full array although I doubt the results will be much better.

WhiteFangs commented 6 years ago

I have an (easy) idea to (maybe) improve my model without making it heavier: I'll let it run through a subpart of the corpus to train for words (as I did for my light array), and then run it through the rest of the corpus without adding the new words encountered (that's what makes the model much heavier after each pass), it'll only increment occurrences of already known words, thus improving the statistical model without making it bigger. I hope to have more human readable results thanks to this. I'll keep the thread updated.

WhiteFangs commented 6 years ago

So I ended up using an even lighter version of the model with 3000 plots for the words and the rest for improvements of the model. The result seems better, still not very readable but sometimes funny. It kind of looks like plots being told by a child who has no proper grammar but good enough vocabulary. Think of it this way and it can actually make some sense =)

I also changed the length of the plots, between 50 and 250 words for each. The result is here: http://louphole.com/divers/1001-plots.html

greg-kennedy commented 6 years ago

Not sure how much you've done with Markov chains before, but ... grammar quality is basically controlled by the (word) length of each phrase in your lookup hash. This is called "order" in technical terms, at least according to Wikipedia.

I looked over your code and it seems like your table is "word1" -> pick_random_of("word2","word3","word4"), which is essentially just Order == 1.

To get better results, your seed phrase "word1" should be a two- or three-word phrase, so the followup word makes more sense in context. That way instead of picking the next word based on the word before it, pick the next word based on the previous 2 or 3 words.

"word1 word2" -> array("word3", "word5"), "word2 word3" -> array("word4"), etc

If you're familiar with Perl at all, maybe give this a lookover - I did a Markov perl module for an entry a couple years ago, you can steal ideas from it.

https://github.com/greg-kennedy/MarkovChain

WhiteFangs commented 6 years ago

I was aware of the order parameter for Markov Chains but the corpus of WikiPlots is composed of many (many) proper nouns and I feared they would bias the model into copying existing sentences. Also I didn't have the time to test it to see if the results would be better.

I also did the same as you but in PHP, that's sort of the code I used for my model and text generation : https://github.com/WhiteFangs/WordBasedMarkov

Thanks for your advice though!